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ABSTRACT 


Fault diagnosis in large-scale systems that are products of modem technology present 
formidable challenges to manufacturers and users. This is due to large number of failure 
sources in such systems and the need to quickly isolate and rectify failures with minimal down 
time. In addition, for fault-tolerant systems and systems with infrequent opportunity for 
maintenance (e.g., Hubble telescope, space station), the assumption of at most a single fault in 
the system is unrealistic. In this project, we have developed novel block and sequential 
diagnostic strategies to isolate multiple faults in the shortest possible time without making the 
unrealistic single fault assumption. 
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1. EXECUTIVE SUMMARY 


1.1 Problem Definition and Significance 


Diagnosis is the process of identifying the cause of a malfunction by observing its effects at 
various monitoring/test points in a system. As technology advances, there is a significant 
increase in the complexity and sophistication of systems. Moreover, integration and 
miniaturization have sharply limited access to test points. Thus, the number of failure sources 
have increased while reduction in monitoring points have resulted in reduced fault observability, 
making it increasingly difficult to troubleshoot these systems. Consequently, system 
maintenance presents formidable challenges to manufacturers and users. In this vein, computer- 
aided design techniques for system modeling and computational algorithms for test sequencing 
are of paramount significance. This research has developed novel multiple fault diagnosis 
algorithms to directly address this vital need. 


Maintenance and design,- have traditionally been two separate engineering disciplines with 
often conflicting objectives: maximizing ease of maintenance versus optimizing performance, 
size and cost. Testability analysis has been an ad hoc, manual effort, in which maintenance 
engineers attempt to identify an efficient method of troubleshooting for the given product, with 
little or no control over product design. Testability deficiencies in the design can not therefore 
be rectified. This adversely impacts the life-cycle cost. It is now widely recognized that 
testability must be engineered into the product at the design stage itself, so that an optimal 
compromise is achieved between system maintainability and performance. This process of 
refining a system design to improve testability is termed Design for Testability (DFT), and is 
now a requirement in most complex system development projects. 

Our previous research has developed multi-signal directed graph modeling techniques 
that enable the representation of a system either top-down (as lower-level details become 
available), bottom-up (for system integration tasks) or a combination of both. In addition, we 
have devised test sequencing algorithms to analyze the testability of a system design, and to 
determine a near optimal sequence of tests for diagnosing single faults in hierarchical systems. 
A solution to the test sequencing problem is a decision tree, which specifies the test to perform 
next depending on the outcomes of previously applied tests. A novel feature of our approach is 
the integration of concepts from information theory and AND/OR graph search techniques to 
overcome the computational explosion of the optimal test sequencing problem [1], 
Furthermore, the top-down nature of the search algorithms have enabled us to derive a variety 
of near-optimal and practical diagnostic strategies that provide a tradeoff between the degree of 
suboptimality and computational complexity [1-11]. The resulting algorithms have 
demonstrated their utility on large hierarchical systems: Boeing-Sikorsky has employed our 
algorithms on a flight-control system model with 8 levels of hierarchy and 10,000 faults and test 
points; we have generated the troubleshooting strategies of a space shuttle main propulsion 
system with 7000 failure sources and a similar number of test points (using a digraph model 
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provided by NAS A- Ames) in only 1.5 hours on a Sparc station 10. In this effort, we have 
extended the single-fault diagnostic strategies to situations where multiple faults may be present. 


1.2Research Results 


1.2.1 Sequential Algorithms for Multiple Fault Diagnosis 

As part of our effort on multiple fault diagnosis, we investigated the problem of 
constructing near-optimal test sequencing algorithms for diagnosing multiple faults in complex 
systems. The computational complexity of solving the optimal multiple-fault isolation problem 
is super-exponential, that is, it is much more difficult than the single-fault isolation problem [1], 
which, by itself^ is exponential. By employing concepts from information theory and Lagrangian 
relaxation, we developed several static and dynamic (on-line or interactive) test sequencing 
algorithms for the multiple fault isolation problem that provide a tradeoff between the degree of 
suboptimality and computational complexity. Furthermore, we derived novel diagnostic 
strategies that generate a static diagnostic directed graph (digraph), instead of a static diagnostic 
tree, for multiple fault diagnosis. Using this approach, the storage complexity of the overall 
diagnostic strategy reduces substantially. Computational results based on real-world systems 
from Sikorsky Aircraft indicate that the size of static multiple fault strategy is strictly related to 
the structure of the system, and that the use of an on-line multiple fault strategy can diagnose 
faults in systems with as many as 10,000 failure sources. The details on sequential multiple fault 
strategies may be found in the following references: 

12. Shakeri, M., Pattipati, K., Raghavan, V., Patterson-Hine, A., and Kell, T.,” 
Sequential Test Strategies for Multiple Fault Isolation”, 1995 IEEE AUTOTESTCON, 
Atlanta, GA, Aug. 1995. 

13. Shakeri, M., Pattipati, K., Raghavan, V., Patterson-Hine, A., and Iverson, 
D.L.,” Multiple Fault Isolation in Redundant Systems”, 1995 IEEE International 
Conference on Systems, Man and Cybernetics, Van Couver, BC, October 1995. 

14. Shakeri, M., Raghavan, V., Pattipati, K., and Patterson-Hine, A., “Sequential 
Testing Algorithms for Multiple Fault Isolation,” submitted to IEEE Transactions on 
Systems, Man and Cybernetics, August 1996. 

1 .2.2Fault Diagnosis with Imperfect Tests 

We investigated two fault diagnosis problems for the case when tests are imperfect : (1) 
sequential fault diagnosis under single fault assumption; and (2) fault diagnosis when all test 
results are available as a block. 

When tests are imperfect, the test sequencing problem corresponds to a partially observed 
Markov decision problem (POMDP), a sequential multi-stage decision problem wherein the 
states are the set of possible failure sources and information regarding the states is obtained via 
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the results of imperfect tests. The optimal solution for this problem was obtained by applying a 
continuous state dynamic programming (DP) recursion. However, the DP recursion is 
computationally very expensive owing to the continuous nature of the state vector comprising 
the probabilities of faults. In order to alleviate the computational explosion, we developed an 
efficient implementation of the DP recursion. We also considered various problems with special 
structure (e.g., parallel systems) and derived closed-form solution/index-rules without having to 
resort to DP. Finally, we developed a variety of top-down graph search algorithms for 
problems with no special structure, including multi-step DP, multi-step information heuristics 
and certainty equivalence algorithms. We compared these near-optimal algorithms with DP for 
small problems to gauge their effectiveness. The details on test sequencing with unreliable tests 
may be found in the following reference: 

15. Raghavan, V., Shakeri, M., and Pattipati, K., “Test Sequencing Algorithms with 
Unreliable Tests,” submitted to IEEE Transactions on Systems, Man and Cybernetics , 
August 1996. 

Next, we considered the problem of constructing optimal and near-optimal multiple fault 
diagnosis (MFD) in bipartite systems (i.e., systems with failure sources connected directly with 
tests) with unreliable tests. It is known that exact computation of conditional probabilities for 
multiple fault diagnosis is NP-hard. The novel features of our diagnostic algorithms was the use 
of Lagrangian relaxation and subgradient optimization methods to provide: (1) near-optimal 
solutions for the MFD problem, and (2) upper bounds for an optimal branch-and-bound 
algorithm. The proposed method was illustrated using several medical diagnosis examples. 
Computational results indicated that: (1) our algorithm has superior computational performance 
to the existing algorithms (approximately three orders of magnitude improvement over the 
algorithms in the artificial intelligence literature; (2) the near-optimal algorithm generates the 
most likely candidates with very high accuracy; and (3) our algorithm can find the most likely 
candidates in systems with as many as 1000 faults. The details of the algorithm may be found in 
the following references: 

16. Shakeri, M., Raghavan, V., Pattipati, K., and Patterson-Hine, A., “Optimal and 
Near-optimal Algorithms for Multiple Fault Diagnosis with Unreliable Tests,” 1996 
IEEE AUTOTEST Conference , Dayton, OH, September 1996. 

17. Shakeri, M., Raghavan, V., Pattipati, K., and Patterson-Hine, A., “Algorithms for 
Multiple Fault Diagnosis with Unreliable Tests,” submitted to IEEE Transactions on 
Systems, Man and Cybernetics , August 1996. 
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Abstract 

In this paper, we consider the problem of constructing near- 
optimal test sequencing algorithms for diagnosing multiple 
faults in redundant (fault-tolerant) systems. The computa- 
tional complexity of solving the optimal multiple-fault isola- 
tion problem is super-exponential, that is, it is much more 
difficult than the single-fault isolation problem, which, by it- 
self, is NP-hard 1 [l], By employing concepts from information 
theory and Lagrangian relaxation, we present several static 
and dynamic (on-line or interactive) test sequencing algo- 
rithms for the multiple fault isolation problem that provide a 
trade-off between the degree of suboptimality and computa- 
tional complexity. Furthermore, we present novel diagnostic 
strategies that generate a static diagnostic directed graph (di- 
graph), instead of a static diagnostic tree, for multiple fault 
diagnosis. Using this approach, the storage complexity of the 
overall diagnostic strategy reduces substantially. Computa- 
tional results based on real-world systems indicate that the 
size of a static multiple fault strategy is strictly related to 
the structure of the system, and that the use of an on-line 
multiple fault strategy can diagnose faults in systems with as 
many as 10,000 failure sources. 

1 Introduction 

The complexity associated with the maintenance of large 
integrated systems, such as the space shuttle or a modern 
aircraft consisting of mechanical, electro-mechanical and 

'Research supported in part by the Department of Economic 
Development of the State of Connecticut, NASA-Ames Research 
Center, Sikorsky Aircraft and Qualtech Systems, Inc. 

1 This means that the computational requirements of an opti- 
mal algorithm cannot be bounded by a polynomial function of the 
number of failure sources and/or the number of tests. 


hydraulic subsystems, presents formidable challenges to 
manufacturers and end users. This is due to the large 
number of failure sources and the need to quickly iso- 
late and rectify such failures with minimal down time. 
In addition, for redundant (fault-tolerant) systems and ' 
for systems with little or no opportunity for repair or 
maintenance during their operation (e.g., Hubble tele- 
scope, space station), the assumption of at most a single 
failure in the system between consecutive maintenance 
actions is unrealistic. Thus, the efficient maintenance of 
complex redundant systems requires advanced diagnostic 
algorithms for multiple fault isolation. This paper con- 
siders the problem of constructing efficient algorithms for 
diagnosing multiple faults in systems with and without 
redundancy. 

For diagnostic purposes, we only need to model how 
a failure (or cause) propagates to the various monitoring 
points. Consequently, it is sufficient to model the system 
in its failure space. That is, the model does not describe 
how the system normally performs, but how the various 
failure sources manifest themselves as malfunctions. The 
failure propagation is modeled in the form of first-order 
cause-effect relationships using digraph techniques. The 
fundamental premise of digraph techniques is that the 
cause-effect linkages must connect the fault origin to the 
observed symptoms of the fault. The digraph models 
encompass a variety of modeling approaches, including 
dependency models [32], signed directed graphs [33], and 
fault trees [34]. 

Once a system is described in terms of a digraph model, 
the full order dependencies among failure sources and 
tests can be captured by a binary test matrix B, consist- 
ing of the failure sources as row indices and the tests as 
column indices [12]. This binary test matrix can be used 
to diagnose single faults, as well as multiple faults in sys- 



terns having no redundancy. This assertion is based on 
the assumption that the failure sources are independent 
and, consequently, the failure signature of a multiple fail- 
ure is the union of failure signatures of individual failure 
sources. However, this property is not valid for systems 
with redundancy, even under the assumption of failure 
independence. The single faults and minimal faults, i.e., 
minimum number of faults with a failure signature dif- 
ferent from the union of failure signatures of individual 
faults, together with their failure signatures, constitute 
the necessary information for fault diagnosis in redun- 
dant systems. Thus, the problem of generating a binary 
test matrix in redundant systems reduces to the prob- 
lem of finding minimal faults of a digraph model. After 
generating the binary test matrix, the problem is to de- 
sign a sequential testing strategy for diagnosing multiple 
faults. Thus, multiple fault diagnosis involves two se- 
quential steps: (1) generation of a binary test matrix, 
which contains all the necessary information for single- 
fault and multiple-fault diagnosis, and (2) design of a 
multiple-fault testing strategy that unambiguously iso- 
lates the failure sources with minimum expected testing 
cost (or time). i 

The problem of finding minimal faults in digraph mod- 
els is much more difficult than that in the fault tree mod- 
els, which, by itself, is NP-hard [2]. This is because 
a fault tree model contains no cycles (feedback loops), 
and because there exists only one target event, for which 
the minimal faults (cuts) should be computed. Eauzy 
[2] considered the problem of computing minimal faults 
(cuts) of fault tree models, and presented an efficient 
method to compute them using binary decision diagrams. 
Vatn [3] presented a method for the identification of min- 
imal cut sets in a fault tree. The cut sets are stored in 
a virtual tree structure. In this method, by traversing 
the virtual tree, minimal cuts of size one are identified 
first. Then, in the second iteration, all minimal cuts of 
size two are identified and compared with the cut sets of 
size one to exclude non-minimal cuts. This procedure is 
continued until all minimal cuts are identified. 

Since the number of minimal cuts can increase expo- 
nentially with the size of the tree, it is practical to trun- 
cate the computation by neglecting higher order and/or 
low-probability faults. Brown [4, 5] presented an algo- 
rithm that uses probability-based truncation, and deter- 
mines a rigorous upper bound on each event-probability 
by propagating the effect of all the truncated cut sets in 
the form of numeric residuals. Iverson and Patterson- 
Hine [6] considered the problem of generating singletons 
(single fault) and doubletons (double faults) in digraph 
models. A major contribution of this paper is the de- 
velopment of a top-down recursive algorithm that finds 
all the minimal faults in digraph models, and an efficient 


bottom-up algorithm that finds minimal faults up to a 
limited size. The failure signatures of minimal faults are 
generated thereafter, and the single-fault binary test ma- 
trix is augmented to include this information. 

Davis [7, 8] described a fault diagnosis system that rea- 
sons from the knowledge of structure and behavior. Fail- 
ure candidate generation in this approach occurs in three 
basic steps: circuit simulation and discrepancy collec- 
tion, potential candidate determination, and global con- 
sistency determination using constraint suspension tech- 
niques. However, for multiple fault diagnosis, this ap- 
proach suffers from severe computational explosion, de 
Kleer and Williams [9] presented a model-based approach 
to fault diagnosis. By keeping track of multiple sets of 
consistent and inconsistent components, their algorithm 
generates minimal sets of faulty candidates rather than 
generating all possible candidates. This approach re- 
quires the complete specification of system components, 
the state and observed variables associated with each 
component, and the functional relationships among the 
state variables. However, the precise information re- 
quired by these models is typically not available for com- 
plex systems and is too costly to obtain. In addition, 
because of extensive use of functional simulation, this ap- 
proach is extremely slow, and, thus, is not appropriate for 
fault diagnosis in large scale systems with the complexi- 
ties of many orders of magnitude more than the examples 
presented in [9]. Sheppard and Simpson [35] provided 
a formal analysis of the multiple failure problem in the 
context of information flow model. They discussed the 
computational complexity of several algorithms for di- 
agnosing multiple failures, and developed algorithms uO 
generate multiple fault diagnoses for a given ambiguity 
group. However, this method does not take into account 
the failure probabilities of components, test costs, or sys- 
tem redundancies. 

In this paper, we first extend the single-fault strategy 
of our previous work [1, 10, 12, 28] to diagnose multi- 
ple faults by successive replacement of single fault candi- 
dates. Using this strategy, we seek to isolate the poten- 
tial single-fault candidates, then double-fault candidates, 
and so on. Since a component may be repaired/replaced 
before confirming that it is indeed faulty, the probability 
of false alarm error or RTOK (retest OK) is higher than 
that with multiple fault strategies that use all informa- 
tive tests before repairing a component in the system. 

Next, we focus on developing a class of Sure strate- 
gies [11] for diagnosing multiple faults in digraph models 
that employ all informative tests before diagnosis. The 
basic idea of these strategies is to find one or more defi- 
nitely failed components, while not making an error when 
other co-existing faults are present. Furthermore, in or- 
der to eliminate the problems associated with the stor- 
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age of the complete diagnostic strategy, an interactive 
testing strategy has been implemented. Instead of gen- 
erating the entire diagnostic tree, the interactive testing 
strategy suggests the next test to be applied, given the 
outcomes of previously applied tests, and generates the 
path leading to the isolation of multiple failures in a sys- 
tem. We employ concepts from information theory and 
Lagrangian relaxation to generate several on-line diag- 
nostic strategies. Using these strategies, we can diagnose 
multiple faults in large systems with as many as 10,000 
failure sources. 

2 Problem Formulation 

We assume that the system is modeled by the digraph 
DG = {S,T,A,E}, where E denotes the set of directed 
edges specifying the functional information flow in the 
system and 

• S = {si, ...,s m } is a finite set of independent failure 
sources (failure aspects) associated with the system; 

• T = {f t n } is a finite set of n available binary 

outcome tests, where the integrity of system failure 
sources/components/modules can be ascertained; 

• A = {ai, ..., ajr} is a finite set of AND nodes repre- 
senting system redundancies. 

The input requirements of the various nodes of the 
directed graph are as follows: 

1. Failure node : Unconditional a priori probability vec- 
tor of failure nodes P = [p( s i)> where 

p(si) is the a priori probability of failure source sj. 


3 Single Fault Testing Strategies 

Once a system is described in terms of a digraph model, 
all the necessary information for fault diagnosis can be 
captured by a binary test matrix (fault dictionary), B = 
[bij] of dimension m x n. In a single fault strategy, it is 
assumed that the system is tested frequently enough that 
at most one component has failed. Thus, the test matrix 
denotes the full-order dependency among single failures 
and the tests in the system, i.e., the rows and columns of 
the test matrix correspond to failure sources and tests, 
respectively. The test matrix can be computed by the 
reachability analysis algorithms [12]. 

The single fault diagnosis problem, in its simplest 
form, is the five-tuple (S, P, T, C, D), where 

• S =S V {so}={so,si, ...,s m } is a set of failure 
sources, where so is a dummy failure source denot- 
ing fault-free condition and V denotes the union of 
two sets; 

• P =[p 0 ,pi, ..., p m ] is the conditional probability vec- 
tor associated with the set of failure sources S', based, 
on a single fault assumption [11], where po is the 
probability of fault-free condition, so- These are re- 
lated to unconditional prior probabilities {p(s,-)} via: 


Po 


Pi = 


1 

1 -L V” 1 P( J *) 

1 + 2^k= 1 1 -p(s k ) 

P(*i ) 

i a. y ,m p( 5 0 
L + 2-,k=i 1 -p( 3t ) 


for i = 1, 


( 1 ) 


m 


2. Test node : A set of test costs C = {ci, C 2 , ..., c n }, 
where Cj is the cost of applying test tj, measured 
in terms of time, manpower requirements, or other 
economic factors. 

3. AND node : Two sets U = {«i,...,«ic} and V = 

{«!, where u* and Vk denote ut-out-of-t;* 

logic for AND node a*, i.e., AND node at has 
inputs and a failure must occur in at least «t inputs 
of this AND node for the faults to propagate to the 
output. 

The problem is to design a testing strategy that unam- 
biguously isolates the failure sources with minimum ex- 
pected testing cost. The AND/OR sequential test strat- 
egy is represented in the form of a tree or a graph, where 
the OR nodes represent the suspect sets of failure sources, 
AND nodes are tests applied at various OR nodes, and 
the leaves are the isolated failure sources. 


• T and C are as defined in Section 2; 

• D = [dy] is a binary test matrix of dimension (m + 
1) x n, where doj = 0 for 1 < j < n, and <fy = 6,j 
for 1 < t < m and 1 < j < n. 

The algorithms for designing optimal single-fault di- 
agnostic strategies are based on dynamic programming 
(DP) [13], and AND/OR graph search procedures. The 
DP technique is based on a bottom-up procedure, and 
has storage and computational requirements of 0(3”) 
for even the simplest test sequencing problem. The 
AND/OR 2 graph search algorithms are top-down heuris- 
tic graph search procedures that employ a cost-to-go es- 
timate to speed up the solution search process [1], 

2 These AND/OR nodes of the search graph should not be con- 
fused with the AND nodes of a digraph model. AND/OR graph 
search formalizes the strategy generation process, where as AND 
node of the digraph model denotes redundancy. 
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A novel feature of this approach is that the cost-to- 
go estimate (termed the Heuristic Evaluation Function 
(HEF)) is derived from Huffman coding and entropy. 
These information theoretic lower bounds ensure that 
an optimal solution is found using the AO*, HS, and 
CF search algorithms [12]. In addition, because of the 
top-down nature of the AND/OR graph search algo- 
rithms, several near-optimal search algorithms have been 
derived: (1) AO* algorithm, (2) limited search AO*, and 
(3) Multi-step information heuristics. Furthermore, be- 
cause of their top-down nature, these algorithms extend ■ 
naturally to: (1) modular diagnosis, (2) precedence con- 
straints, setup operations, and resources and (3) recti- 
fication. The algorithms have been implemented in a 
software package, termed TEAMS (Testability Engineer- 
ing and Maintenance System[12]). For convenience, these 
algorithms are referred to as the TEAMS-S algorithms. 

Example l.a: Consider the digraph model in Figure 1. 
In this system, there are five failure sources sj,..., s 5. 
The set of five tests, labeled 1 1,..., is, may be used to 
identify the unknown failure sources. The test matrix, 
along with the a priori probabilities of failure sources 
and test costs, is shown in Table 1. Based on a single 
or no fault assumption, the set efiailure aspects S = 
{so, Si, ..., S5}, with the concomitant conditional proba- 
bility vector P =[0.700, 0.01, 0.020, 0.100, 0.050, 0.120]. 
An optimal test strategy for this example is shown in 
Figure 2. For this test strategy, the average test cost is 
J — YlTLo YltjtTAi Cj -P, =2.18, where TA, is the set of 
applied tests in the path leading to the isolation of failure 
source Si G S. 


ti 



Figure 1: Digraph model for Example l.a 


The single fault assumption may not be valid in sit- 
uations where the opportunity for frequent maintenance 
does not exist. In such cases, the single fault strategies 
can give wrong diagnosis when multiple failures occur. In 
[11], we showed that the set of hidden faults and mask- 


FAILURE 

SOURCES 

TESTS 

TEST COSTS Cj 
11111 
t\ f2 ^3 ^4 ^5 

FAULT 

PROBABILITIES 

P(s.) 

«1 

0 100 1 

0.014 

S2 

0 0 110 

0.027 

S3 

10 0 11 

0.125 

s 4 

110 00 

0.068 

S5 

11110 

0.146 


Table 1: Test Matrix, Apriori Fault Probabilities and 
Test Costs for Example l.a 



Figure 2: Single-fault Test Strategy for the System of 
Example l.a 


ing false failures are potential multiple fault candidates 
at each leaf node of the single fault diagnostic tree. The 
set of hidden faults for failure source s, consists of those 
failure sources whose failure signatures corresponding to 
TA, are subsets of the failure signature of s,-, while the 
set of masking false failures for failure source s» consists 
of those sets of failure sources whose failure signatures 
corresponding to tests TAi add up to mask the failure 
signature of s,-. Hidden faults can be diagnosed by ap- 
plying a single fault strategy repeatedly [11]. However, 
if the set of masking false failures at the leaf nodes is not 
empty, the single fault strategy will give wrong diagnosis, 
and repairing the implicated fault is obviously of no use 
in this case. In the next section, we present an extended 
single fault strategy to diagnose masking false failures, 
as well as hidden faults in a system. 


4 Multiple Fault Diagnosis Using an Ex- 
tended Single Fault Testing Strategy 

In order to formalize this approach, let 
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• TSj = test signature associated with test tj. It in- 
dicates all the failure sources detectable by test tj, 
i.e., TSj = {s,|6jj = 1 for 1 < i < m}, 

* G = union of test signatures of previously passed 
tests. 

In this approach, we invoke a single fault strategy, and 
repair/replace the identified component at each leaf node, 
if any. Then, we check whether the repaired/replaced 
component at each leaf node is definitely faulty or not. 
If for any test tj that failed previously, the cardinality of 
TSj — G is one, i.e., TSj — G contains only one failure 
source, then the corresponding failure source is definitely 
faulty. If the repaired/replaced component is definitely 
faulty, we apply additional tests, if necessary, to isolate 
the remaining faults. Additional tests can be applied 
from either the root OR, node, or from the first failed 
test in the path leading to the identification of previous 
faults. This process ensures that we do not come back to 
the same leaf node twice. 

Alternatively, if the replaced module is not definitely 
faulty, there exist other sets of components which have 
the same failure signature as the failure signature of re- 
placed module, i.e., masking false failures [11], In this 
case, if we start from the root OR node or the first failed 
test in the path, we may reach the same leaf node. In 
order to solve this problem, we remove the replaced mod- 
ules from the ambiguity group at the current stage of di- 
agnosis, and invoke the single fault strategy TEAMS-S 
to isolate the remaining suspected components. Then, we 
repair/replace the identified modules at each leaf node. If 
the repaired/replaced module at a leaf node of this tree is 
definitely faulty, we apply additional tests from the root 
OR node or from the first failed test after last repair. On 
the other hand, if the identified module at a leaf node is 
not definitely faulty, we update the ambiguity group and 
invoke single fault strategy as before. This procedure is 
continued until no test gives further information or the 
system is fault-free. 

One drawback of the extended single-fault strategy is 
that the probability of repairing/replacing a good com- 
ponent, i.e., false alarm error or RTOK (retest OK), is 
higher than that with multiple fault strategies that em- 
ploy all informative tests before repairing a component 
in the system (see section 5.2). Furthermore, in the 
case of very large systems, it is practical to solve mul- 
tiple fault isolation problems up to a certain cardinality 
L > 1, e.g., single or double failures. This is based on 
the premise that multiple faults of large cardinality are 
much less likely to occur. However, in an extended sin- 
gle fault strategy, if we stop expanding the diagnostic 
tree after limited repair actions, say L, it does not mean 
that we can diagnose multiple faults up to size L using 


the same tree. This is because a component may be re- 
paired/replaced before confirming that it is indeed faulty. 

Example l.b: In this example, we consider the same 
system as in Example l.a. The extended single fault 
diagnostic strategy for this example is shown in Figure 
3, where the ACTION nodes represent the actions to 
be performed at each stage of diagnosis. Note that the 
shaded parts of the tree are the same as those in a single 
fault diagnostic tree of Figure 2. The average testing cost 
for this case is J =2.780. The joint probability that $5 is 
good, and is repaired/replaced is 0.0103. 



Figure 3: Extended Single Fault Strategy to diagnose 
multiple faults in Example l.a 

5 Multiple Fault Testing Strategy in 
Systems without Redundancy (AND 
nodes) 

In digraph models without AND nodes, i.e., without re- 
dundancy, a test-matrix containing the full-order depen- 
dency among single failures and the tests can be used to 
diagnose multiple faults. This is because in these models 
the failure signature of a multiple-failure is assumed to 
be the union of failure signatures of individual failures 
(failure independence assumption). 

One approach that employs all informative tests before 
repauring/replacing a component is to consider all possi- 
ble combinations of failure sources, i.e., 2 ®, and generate 
an optimal multiple fault diagnostic strategy using the 
single-fault test sequencing algorithm TEAMS-S. How- 
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ever, the storage and computational complexity of opti- 
mal multiple-fault isolation problem is super-exponential 
in m. In order to reduce storage complexity, we use a 
compact set notation [14], and in order to reduce the 
computational complexity, we present a class of Sure di- 
agnosis strategies for multiple fault isolation. 

5.1 Compact Set Notation 

Following Grunberg et al. [14], we use the compact nota- 
tion A= Q(L] Fi , ..., F L ] G) to denote the multiple fault 
ambiguity group at each OR node. The Fi for i = 1, ..., L 
and G are subsets of S = {s 0 ,si, ...,s m }; G is the set of 
known good failure sources (failure free sources), and Fi 
for i — 1, ..., L are sets that are known to contain at least 
one definitely failed failure source each, i.e., 

Q(L]Fi,F 2 ,...,F L ]G) = {X CS\ 

XAFj/8 for i = 1, ..., L, and X AG = 0} 

where A denotes the intersection of two sets. In the fol- 
lowing, jWe summarize some of the properties of compact 
set notation [11, 30]: 

1. Multiple fault logic using the compact set notation 
is as follows: the initial hypothesis set is the set of 
all subsets of S , i.e., A= 0(1; Fi = S ; G = 0). 
After performing a test, say tj , the hypothesis set A 
= 0(L; Fi,..., Fl] G) is decomposed as follows: 

f Q(L; (Fi A TSj), ...,(Fl ATSj); 

A <— < (G V TSj)) if tj passes 

l ©(£ + 1; Fi, ...,FL,TSj A G c ; G) if tj fails 

where TSJ and G c are complements of the sets TSj 
and G, respectively. 

2. If E D Fi for some i (that is, E is a superset of Fi), 
then Q(L + 1; F u ..., F L , E ; G)=©( L; F u ...,F L ] 
G) [14]. Thus, we should not apply any test whose 
signature is a superset of one of the Fi* s, since the 
test does not give any new information. 

3. A = 0(1; Fi, ..., F l ; G ) = 0(£; F x A G e , ..., F l AG c ] 
G), where superscript c denotes the set complement, 
i.e., G e = S-G [14]. 

4. Given a set of previously applied passed tests T p CT 
and failed tests Tj C T, the multiple fault ambigu- 
ity group at the current stage of diagnosis can be 
generated directly as follows: 0(A; F\, Fl] G), 
where G = V tj6T ,T5i, T=|T/|+1, F X =S (see the 
first property), and Fi+i=TSj AG e for i — 1, ..., |T/| 
and tj € Tj ; and then, employ property 2 to remove 
super sets from the set F = {Fi, ..., Fl}- 


5. If \Tf\ — 0, then L - 1 and s 0 € Fi. If \T } \ > 0, 
none of the F,’s contains so- 

6. The worst case storage complexity of compact set 
notation for an OR node is O(mn) [11], 

7. The failure sources belonging to F, with cardinality 
\Fi\ = 1 are definitely faulty ( one-for-sure condi- 
tion). 

5.2 Sure Strategies for Multiple Fault Diagnosis 

In this section, we present three diagnostic strategies, 
Sure 1-3, that seek to find definitely failed components, 
even though there may be others still undiagnosed. Thus, 
these strategies isolate failures one (or more) at a time, 
while not making an error when multiple faults are 
present. The framework for Sure strategies is sketched 
in Figure 4. 



Figure 4: Framework of Sure Strategies in a Test-and- 
repair Cycle 

The three basic ingredients of Sure 1-3 are: (i) min- 
imal candidate generation, (ii) minimal candidate isola- 
tion, and(iii) multiple fault propagation. The minimality 
property implies that a particular candidate includes the 
minimum number of failure sources that explains all test 
results observed so far (if any). Consequently, the inher- 
ent combinatorial explosion that occurs in generating an 
optimal multiple fault strategy is reduced substantially. 
Before describing the algorithms, we define minimal (ir- 
reducible) set and hitting set of a set of subsets: 

Definition 1 : A minimal or irreducible set for a collec- 
tion of subsets Q = { Q\ , ..., Qt} is a set I(Q) C <5 such 
that I(Q) = Q— {Qi\3Qj 6 Q and Qj C Q«}, i.e., I(Q) 
is equal to set Q without any super set. 

Definition 2 : A hitting set for a collection of sets Q = 
{Qi, ..., Qt} is a set H(Q) ={Hi, ..., H q } such that Hj C 
Vi <i<kQi for j = 1 , ..., q , and HjAQi ^ 0 for i = 1 , ..., k. 

Based on these definitions, it can be shown that [30]: 
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Lemma 1 : The minimal set of a multiple fault am- 
biguity group A = 0(£; F lt ..., Fl;G) is the minimal 
hitting set for the collection of sets F = {Fi, F l}, i.e., 
I(A) = I(ff(F)). 

In Sure 1-3 strategies, at each stage of diagnosis, we 
consider the minimal candidate set of the multiple fault 
suspect set corresponding to the OR node at that stage. 
Reiter [15] has derived an algorithm to determine the 
minimal hitting set of a collection of sets, and Greiner et 
al. [16] have presented a correction to the Reiter’s algo- 
rithm. We use this technique to determine the minimal 
hitting set of F = {Fi, ..., F/,} at an OR node. After 
determining the minimal candidates of a multiple fault 
suspect set at the current stage, we evaluate the con- 
ditional probabilities of minimal candidates using Bayes’ 
rule. Then, we invoke the single fault strategy TEAMS- 
S to isolate these candidates, and propagate multiple 
fault suspect set through the resulting diagnostic tree. 
Note that, using the fourth property of compact set no- 
tation, it is sufficient to generate and store multiple fault 
ambiguity group at the leaf nodes of this tree only. We 
repeat these procedures for each leaf node of the tree, 
until: (1) the intersection of minimal candidates is not'' 
empty, i.e., the corresponding failure sources -r re. defi- 
nitely faulty, or (2) no test provides further information. 
The former corresponds to the case when the cardinality 
of one or more F; in the ambiguity group is one. 

After repairing/replacing the components isolated by 
Sure strategies, we apply additional tests, if necessary, 
to isolate the remaining failure sources. We explore 
three different approaches for the application of addi- 
tional tests: (1) start from the root OR node of the diag- 
nostic tree; (2) start from the first failed test in the path 
leading to the isolation of previous faults; (3) update the 
multiple fault suspect set at the leaf node by integrat- 
ing previous test results using the fourth property of the 
compact set notation, removing repaired/replaced fail- 
ure sources from the ambiguity group at the leaf node, 
and invoking Sure strategies for the updated ambiguity 
group. Sure 1-3 algorithms correspond to the first, sec- 
ond and third approaches for applying additional tests, 
respectively. These are presented in detail in [17]. 

The Surel diagnostic strategy is simple and the result- 
ing diagnostic tree is very similar to the single fault diag- 
nostic tree. However, the expected testing cost using this 
strategy is usually high. The expected testing cost using 
Sure2 diagnostic strategy is less than the first one, but 
the next test to be performed after repairing/replacing 
each failure source will be different. Furthermore, the di- 
agnostic tree will change to a digraph (directed graph). 
The expected testing cost for the third approach is the 
smallest, but the size of the diagnostic tree will be con- 
siderably larger than the others. This is because the 


number of leaves of the diagnostic tree is the same as 
the number of distinguishable multiple-fault failure sig- 
natures. For example, in the worst case, i.e., when the 
test matrix B is diagonal, the number of leaves is 2 m . 
This is because there are 2 m possible multiple-fault fail- 
ure signatures. But, the number of leaf nodes in Surel 
and Sure2 diagnostic strategies in this case are the same 
as in a single-fault strategy, i.e., m+ 1. 

One of the interesting features of Sure strategies is that 
the starting point for all three algorithms is the same 
tree as in a single fault strategy for the system under 
consideration. This is because the minimal candidate 
set for 2 s is {s 0 , si, ..., s m }. Therefore, these strategies 
isolate a single fault with the smallest average cost, while 
not making an error when multiple faults are present. 
Furthermore, in the case of very large systems, instead 
of generating all minimal candidates, we can generate 
minimal candidates of size less than a certain threshold, 
L, and diagnose multiple faults up to that size. 

Example l.c : Figure 5, without (with) the dashed 
lines, shows the multiple fault strategy for the system 
in Example l.a, based on Surel (Sure2) algorithm, where 
Ai denotes the ambiguity group corresponding to the 
OR node i, and Ai= O(l;{s 0 ,s 1 ,S2,S3,S4,S5} ;0); A 2 = 
0(1; {so, «2, S3}; {$1,54,55}); A 3 = 0(l;{si,s 4 ,s 5 };0); 
A 4 = 0(l;{so};{si,s 2 ,s3,s4,s 5 }); A 5 = 0(1; {s 2 ,s 3 }; 
{si, £4,55}); Ae = 0(1; {si,s 4 }; {s 2 ,s 3 ,S5}); A7 

= ©(2; {si,s 4 ,s 5 }, {s 2 , s 3 , S5}; 0); A 8 = 0(1; {s 2 }; 
{si,S3,S4,s 5 }); A 9 - 0(1; {s 3 }; {si,s 4 ,s 5 }); AiO 

= ®(l; {sj; {s 2 , S3, s 4 , 55}); An = ©(l; {s 4 }; 

{s 2 , S3, S5}); Ai 2 = 0(2; {s 4 , S5}, {s 2 , s 5 }; {si, S3}); A13 
= 0(3; {si,S4,s 5 }, {s 2 , s 3 , 55} , {si,s 3 }; 0); A X4 = 
0(2; {S3}, {si , S4}; {s 2 , S5}); A 13 “ 0(3; {si,S4,S5}, 
{s 1 ,s 3 },{s 2 ,s 5 }; 0); Aie = 0(2; {si},{s 2 }; {s 3 ,s 4 ,s 5 }); 
An = 0(4; {si,s 3 }, {s 2 ,s 5 }, {s 3 ,S4,s 5 }, {si,s 4 ,s 5 };0) 

Note that the shaded parts of the tree are the same 
as those in the single fault diagnostic tree of Figure 2. 
The average testing cost for the optimal multiple fault 
strategy is J = 2.411, and the average testing cost for 
the first (Surel) and second (Sure2) approaches using 
the diagnostic strategy of Figure 5 are J = 2.715 and 
J = 2.616, respectively. 

Example l.d : The Sure3 strategy for Example l.a is 
shown in Figure 6, where Aig = A 2 q = A 24 = 0(1; {so}; 
{si, s 2 , S3, s 4 , S5}) ; A19 = 0(l;{s 2 };{si,s 3 ,s4,s 5 }) 

; A 23 — 0(l;{s 4 };{s 2 ,s 3 ,s 5 }) ; A 2 1 = A 22 = A 2 5 = 

0( 1 ;{ s i};{ s 2,s 3 ,S4,s 5 }); 

Note that the shaded and dashed parts of the tree in 
Figure 6 are the same as those in Figure 5. For this test 
strategy, the average test cost J = 2.535. In this exam- 
ple, we considered a block replacement strategy when no 
test gives further information, for example, see ambiguity 
groups A 12 and An- 
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Figure 5: Surel and Sure2 Test Strategies for Example 
l.a 

6 Multiple Fault Testing Strategy in Sys- 
tems with Redundancy (AND nodes) 

In digraph models with AND nodes, the assumption that 
the failure signature of a multiple failure is the union 
of failure signatures of the corresponding individual fail- 
ures is not valid. This is because the failures of multi- 
ple modules can propagate to the output of AND nodes, 
and therefore, generate a different failure signature. In 
these models, minimal faults and their failure signatures 
contain all the necessary information for multiple fault 
diagnosis [30]. 

In this section, we first present a top-down recursive 
algorithm to find all the minimal faults and their failure 
signatures; the minimal fault algorithm is presented in 
detail in [18]. Even though this algorithm can easily be 
extended to generate minimal faults with a limited size L, 
we present an efficient bottom-up procedure to generate 
minimal faults up to a limited size L. This is because, in 
very large systems, it is efficient and practical to generate 
minimal faults up to a specified size L using the bottom- 
up procedure. 

Then, after generating minimal faults, we augment the 
binary test matrix B to include minimal faults, and ex- 
tend Sure diagnostic strategies of the previous section to 
systems with redundancy. 

6.1 Minimal Fault Algorithm 

In order to generate the minimal faults and their fail- 
ure signatures, we use the reachability analysis algorithm 
of [12] to build: (1) failure source-test dependency ma- 
trix B of dimension m x n, which denotes the full-order 



Figure 6: Sure3 Test Strategy for Example l.a 


dependency among single failure sources and tests, (2) 
failure source-AND node dependency matrix H of di- 
mension m x z, which denotes the full-order dependency 
among failure sources and AND node inputs and out- 
puts, where z = Y^f=ii v j + 1)> (3) AND node-test de- 
pendency matrix E of dimension K x n, which denotes 
the full-order dependency among AND nodes and tests, 
(4) AND node-AND node dependency matrix R of di- 
mension K x 2 , which denotes the full-order dependency 
among AND node outputs and AND node inputs and 
outputs, and (5) AND node-AND node reachability ma- 
trix Q of dimension K x K, which denotes the full-order 
dependency between AND nodes and AND nodes by set- 
ting AND nodes’ logic of nt-out-of-v* to l-out-of-v* in 
the reachability analysis algorithm, i.e., AND nodes de- 
volve into OR nodes. 

For notational convenience, given a binary matrix X = 
[zjj] of dimension fci x fe 2 , we define A>; for i = 1, ..., fci 
as its ith row, and Xcj for j = 1, ..., fc 2 as its jth column. 

6.1.1 Top-down approach 

Using the binary matrices, the top-down minimal fault 
algorithm finds the minimal faults of the digraph model 
via the following steps (see [18] for details): 

Step 1; 

Because of the definition of minimal faults, the algorithm 
needs to process only those AND nodes As C A for which 
there exists at least a path from the AND node to a test. 


8 



The algorithm sorts the AND nodes in As such that each 
AND node will be processed before any other AND node 
reachable from it. This step prevents the algorithm from 
performing the same operations twice. 

The procedure for finding and sorting As is as fol- 
lows: (1) the algorithm finds a subset of AND nodes 
Ae C A such that each AND node a* G Ae can be de- 
tected by at least a test, i.e., At = VEc, for j = 1, ...,n; 
(2) using AND node- AND node reachability matrix Q, 
it finds the subset of AND nodes As that reach Ae, i.e., 
As = V ak £AeQck', and (3) the algorithm sorts the AND 
nodes in As based on the number of AND nodes reaching 
them in ascending order. Note that the number of AND 
nodes reaching an AND node aj is J2k=i 9k j- 

Step 2: 

For each AND node a*, G As, the algorithm finds the 
set of failure sources and AND nodes that can reach the 
AND node inputs and output. Then, it removes single 
failures affecting the AND node from its input signatures, 
and generates the minimal combinations of AND nodes 
and failure nodes for each AND node a* € As using one 
of the following three approaches: (1) minimal hitting set 
method- using a breadth-first search [15, 16], (2) minimal 
hitting set method using a depth-first search, and (3) bi- 
nary decision diagrams [2], However, because of the small 
number of AND node inputs, i.e., u* for k = 1, ..., K, usu- 
ally 2 or 3, there is no significant difference in using any 
of these three approaches. 

Note that we can consider a limit L for the number 
of failure sources in the minimal combinations of AND 
nodes. In the first and second approaches, those com- 
binations with more than L failure sources are not ex- 
panded. In the third approach, at first the decision dia- 
gram is generated, and then, the combinations with more 
than L faults are eliminated. Furthermore, when Vk=2 
for k = 1, ..., K, the problem of finding minimal combi- 
nations for each AND node reduces to the problem of 
finding the cross-products of failure signatures of AND 
node inputs [6]. 

Step 3: 

After generating the minimal combinations for each AND 
node in As, the algorithm processes one AND node at a 
time. The subroutine for this part is a recursive function 
and, for simplicity, we call it MFG (Minimal Fault Gen- 
erator). In order to find minimal faults for an AND node, 
say Of, we call MFG for a,-. MFG replaces a,- with one of 
its minimal combinations. If this combination contains 
no AND nodes, MFG adds this combination to the set of 
minimal faults of AND node a* only if it is not a superset 
of one of them. On the other hand, if the combination 
contains AND nodes, it selects one of the AND nodes 


from this combination, say aj , and calls MFG again for 
aj . This procedure continues until no AND node remains 
in that combination, or a previously processed AND node 
is selected, i.e., there exists a feedback loop containing 
the AND node. In the former case, MFG adds this com- 
bination to the set of minimal faults of the AND node, 
only if it is not a superset of one of them. In the latter 
case, if the failure of the combination can propagate to 
the output of the AND node, MFG ignores that AND 
node, and continues. Otherwise, it returns without do- 
ing any thing. This step prevents the algorithm from 
entering an infinite loop, when a cycle is encountered. 

Note that we can consider a limit L for the number of 
faults in the minimal faults of AND nodes. In this case, 
at each iteration, MFG checks whether the number of 
faults in the AND nodes and failure nodes combination 
is greater than the limit L or not. If the number of failure 
nodes in this combination is greater than L, it returns; 
otherwise, it expands the selected AND node as before. 

In order to make the algorithm efficient, we employ the 
following Lemma: 

Lemma 2: Let us assume that Xr is a vector of dimen- 
sion z, and if a* G As, X r [fc(/)]=1 for 1 = 0 , ..., Vk, other- 
wise A r [i(/)]=0 for 1 = 0, ... ,Vk . If Hri A Xr is equal to 
Hrj A Xr and there exists a minimal combination me G 
MC(ak ) and s,- G me, then (me— {s»})V{sj} G MC(ak). 
Further, if there exists a minimal fault mf G MF(ak) 
and Si G mf, then (mf — {s»}) V {sj} G MF(ajt). 

Using this Lemma, before generating minimal combi- 
nations of each AND node, we find all the failure sources 
with the same failure signature in the H matrix. That 
is, we generate the set M={Mi, M 2 ,.-.,M a ) such that 
Mi C S for l = 1, ..., a and Vs,- G Mi have the same 
failure signatures in the binary matrix H. Using this ap- 
proach, the failure sources that have the same effect on 
the AND nodes, i.e., the failure sources in series [6], or 
those in Gross feedback loops [12], are considered as a 
group of failures. Thus, instead of generating the mini- 
mal combinations and minimal faults for each AND node 
based on S, we generate them based on M only, i.e., min- 
imal faults are subsets of M. After generating these sets, 
we expand the minimal faults of AND nodes based on M 
to generate the minimal faults based on S. 

Step 4: 

After generating the minimal faults of the AND nodes, 
the algorithm generates the minimal faults of the di- 
graph model (MFd). Firstly, using the AND node-test 
dependency matrix E, the algorithm removes the mini- 
mal faults of those AND nodes that cannot be detected 
by any test. Secondly, if a set of faults belongs to the set 
of minimal faults of two or more AND nodes, the algo- 
rithm considers only one of them. Then, using the binary 


9 



matrices, the algorithm generates the failure signatures 
of remaining faults. Note that the remaining faults may 
contain supersets, and because of the test points in the 
digraph model, a superset may /may not be a minimal 
fault of the digraph model. Thus, those supersets, which 
have the same failure signatures as the union of the fail- 
ure signatures of their subsets, are removed. 

6.1.2 Bottom-up Approach 

The bottom-up approach can be used to generate mini- 
mal faults up to a limited size, say 2 or 3, in systems with 
as many as 10,000 failure sources and 1000 AND nodes. 
For clarity, let us assume that vt=2 for k — 1, ..., K. In 
this algorithm, using the first step of the top-down proce- 
dure, we find a subset of AND nodes As C A that should 
be processed. Then, using the failure source-AND node 
dependency matrix H , for each AND node a* G As, the 
algorithm finds the failure sources that can reach one of 
the AND node inputs, but cannot reach their outputs, 
i.e., Scfc(l), 5cjt(2) for a* G As. By finding the cross- 
products [6] between two sets Scj;(l) and 5c*( 2), the al- 
gorithm generates minimal combinations of size 2. Then, 
using the failure source-AND node dependency matrix 
H and AND node- AND node binary matrix R, the fail- 
ure signatures of these faults can be found and stored 
in a binary matrix B ' . Using binary matrix B', the al- 
gorithm finds the failure sources that can reach one of 
the AND node inputs, but cannot reach their outputs, 
i.e., 5cj.(l),ScJ.(2) for a* G As. By finding the cross- 
products between two sets Scj(l) and 5cj,(2), Sc*(l) and 
Sc? k ( 2), and S’cj.(l) and Sc' k (2), the algorithm generates 
all minimal combinations of size 3, as well as some mini- 
mal combinations of size 4. This procedure is continued 
until either no failure can reach any of the AND node 
inputs, or all the minimal faults of the desired size are 
generated. After generating minimal combinations of size 
L, using the fourth step of the top-down algorithm, min- 
imal faults of size L of the digraph model are generated. 
Note that, because of the presence of feedback loops and 
common elements in some paths in the digraph models, 
it is not efficient to use a bottom-up approach to find all 
the minimal faults of a digraph. 

6.2 Extended Compact set Notation 

After generating minimal faults and their failure signa- 
tures, we expand the binary matrix B with the mini- 
mal fault failure signatures. Thus, in systems with AND 
nodes, each row of the test matrix corresponds to a sub- 
set of S= {$!,..., s m } . For notational simplicity, let us 
assume that the new test matrix contains m„ = m + mj 
rows, where m/ is the number of minimal faults. We de- 
fine W={w 1 , ...,w mn } ) where w { = {s,} for i = l,...,m, 
and Wi C S for m + 1 < i < mn. 


After generating the binary test matrix, we extend the 
compact set notation of the previous section to systems 
with redundancy. In this case, the ambiguity group at 
each OR node of the AND/OR graph is based on W , i.e., 
the Fi for i = 1, ..., L and G are subsets of W — W V {w 0 } 
= {u> 0 ,tyi,...,u> m „}, where {w 0 } = {s 0 } and 

Q(L;F 1 ,F 2 ,...,F l ;G)={X CW\ 

X A Fi # 0 for i = 1, and T(X) A G - 0} 

where T{X) =V ffij cs(j)iOi, and S(X) C S is the set of 
all failure sources in wj G X, i.e., S(X) — {si|ViUj G 
X and Si G Wj}- 

6.3 Extended Sure Strategies 

In order to derive the Sure diagnostic strategy, we need to 
generate the minimal candidates at each iteration. Note 
that, Lemma 1 is not valid for minimal faults in a system 
with redundancy. This is because wj for j = 1, ..., m n are 
not independent, and because of the AND nodes, the fail- 
ure signature of a set of components that has some thing 
in common with the Fi ’ s is consistent with the failed 
tests, but it may be inconsistent with the passed tests. 
In this case, the set of minimal candidates of a multiple 
fault ambiguity group is generated using the following 
Lemma [30]. 

Lemma 3: The minimal set of a multiple fault ambigu- 
ity group A = 0(L; F\, ..., Fl\ G) for a system with re- 
dundancy is I{A)={X\X G I(H(F)) and F(X)A G = 0}, 
where F = {F\, ..., Fl}. That is, the minimal set of a 
multiple fault ambiguity group contains only those ele- 
ments of the minimal hitting set F that are consistent 
with the set of good components, G. 

In addition, the one-for-sure condition of previous sec- 
tion should be generalized as follows: 

Lemma 4 : If the cardinality of any Fi is one, all the 
failure sources in Wj G Fi are faulty, and if the cardi- 
nality of Fi is greater than one, all the failure sources in 
A Wj £Fi w j> are definitely faulty. Evidently, these two con- 
ditions can be combined as follows: all the failure sources 
in A Wj £FiWj, for i = 1, ..., L, are definitely faulty. 

Further, we can use the following two Lemmas to up- 
date the ambiguity groups at each OR node. 

Lemma 5 : Let us assume that we repaired definitely 
failed components ICS, and that there exists a Wj G G 
such that \wj — X\ = 1 and s k = |uy — X\. Then, s k is 
good and should be added to the good component subset 
G. 

Lemma 6 : If we repair definitely failed components 
Wi = {s;}, and there exist a wj such that s,- G Wj, then 
Wj should be added to the good component subset G. 

In summary, the Sure diagnostic strategies for systems 
with redundancy is as follows: 
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• the ambiguity group at each OR node of the 
AND/OR graph is represented based on W (rather 
than S ), 

• minimal candidates are generated based on Lemma 
3 ’ 

• definitely failed components at the leaf nodes are 
found using Lemma 4, and 

• the ambiguity groups at the leaf nodes are updated 
based on Lemmas 5 and 6. 

Example l.e: Consider the digraph model in Figure 7. 
This digraph model dilfers from the one in Figure 1 in 
that we have added an AND node a x . The minimal fault 
for this digraph model is iug={sx, S 3 } (see Table 2). Fig- 
ure 8 without (with) the dashed lines, shows the multiple 
fault strategy for this system r ba9ed on Surel (Sure2) di- 
agnostic strategy, and A,- denotes the ambiguity group 
corresponding to the OR node i, and A x = 0(1; {tt>o, u>i, 
w 2 ,w 3 , w 4 ,w 5 , w 6 }] 0); A 2 = 0(1; {u>o,u> 2 , ^ 3 }; {w x ,w 4 
,w 5 ,w 6 }); A 3 - 0(1; {w x ,w 4 , w 5 , tug}; 0); A 4 = 0(1; 
{too}; {toi, w 2 , tu 3 , w 4 , w 5 , tug}); As = 0(1; {w 2 , w 3 }; 
{u>i, w 4 , w 5 , u> 6 }); As = 0(1; {wi, ty 4 }; {w 2 , w 3 , w 5 , 
tug}); At = 0(1; {toi, w 4 , w 3 , tug}, { w 2 , w 3 , tu 5 ,tug}; 
0); A s = 0(1; {u> 2 }; {tui, w 3 , w 4 , w 5 , w 6 }); A 9 = 0(1; 
{u>3>; {wi, w 4 , w 3 , tug}); A10 = 0(1; {tn 4 }; {«h, w 2 , w 3 , 
tus.tug}); An = 0(1; {tui}; {w 2 , w 3> ws,w 6 }); A 12 = 
0(1; {w 2 , w 5 }, {w 4 , u> 5 }; {wi, w 3 ,w 6 }); A 13 = 0(1; {w x , 
W 3 , tug} , {tu 2 , w 3 , w 5 ,w 6 }, {mi, w 4 ,w 5 , tug}; 0); A X4 = 
0(1; {tu 3 }, {tui, w 4 }; {w 2 , w 5 , tog}); A i5 = 0(1; {toi, to 3 , 
tug} , {tu 2 , w 5 , tog}, {w x ,w 4 ,w 5 ,w 6 }); Ais = 0(1; {tuj}, 
{to 2 }; {to 3 , to 4 , to 5 , tog}); A 17 = 0(1; {toi, to 3 ,to 6 }, {to 2 , 
to 5 , tog}, {toi,to 4 , to 5 ,to 6 }, {to 3 , to 4 , to 5 , tog}). 
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Figure 7: Digraph model 


Note that A 14 = 0(1; {to 3 },{toi,io 4 }; {to 2 ,to 5 ,tog}); 
i.e., F x = {to 3 } ( F 2 = { 101 , 104 }, and G = {to 2 ,to 5 ,tog}). 


Therefore, to 3 = {s 3 } is definitely faulty. After repairing 
s 3 , using Lemma 5; i.e., to 6 E G, w x = {si} is good and 
should be added to the list of good components G. Thus, 
by eliminating toi from F 2 , we conclude that w 4 is defi- 
nitely faulty; the cardinality of F 2 is one. After repairing 
to 4 , G = S, and there is no need to apply additional tests. 
The average testing cost for Surel and Sure2 diagnostic 
strategy using the diagnostic strategy of Figure 7, are 
J = 2.603 and J = 2.505, respectively. 


FAILURE 

SOURCES 

TESTS 
fi t 2 t 3 t 4 t 3 
TEST COSTS Cj 
11111 

Wi = {si} 

0 10 0 1 

«>2 = {«2} 

0 0 110 

w 3 = {s 3 } 

10 0 11 

£ 

II 

£ 

A 

W*' 

110 0 0 

W 5 = {*>} 

11110 

w 6 = {si,s 3 } 

11111 


Table 2: Test Matrix and Test Costs for Example l.e 



Figure 8: Multiple Fault Diagnostic Strategy for Exam- 
ple l.e 

Example l.f: The Sure3 strategy for Example l.e is 
shown in Figure 9, where Ai S = A 2 o = 0(1; {tuo}; {u>i, 
w 2 , w 3 , w 4 , w 5 , tn 6 }) ; A 19 = 0 ( 1 ; { 102 }; {u>i, w 3 , ^ 4 , 
u> 5 , tug}); A 2X = 0(1; {to 4 }; {u>i, w 2 , w 3 , w 5 ,w 6 }). 

Note that the dashed parts of the tree in Figure 9 are 
the same as those in Figure 8. For this test strategy, the 
average test cost J = 2.492. In this example, we con- 
sidered a block replacement strategy when no test gives 
further information, for example, see ambiguity groups 


11 







Figure 9: Sure3 Test Strategy for Example l.e 


A 12 and An. 

7 On-line Multiple Fault Diagnosis 
Strategies - *• 

In this section, we consider the problem of designing an 
on-line (interactive or dynamic) diagnostic strategy to 
isolate multiple failures in a physical system. That is, in- 
stead of generating the entire diagnostic tree, the on-line 
strategy only suggests the next test to be applied given 
the outcomes of previously applied tests. Our approach 
is to employ concepts from information theory and La- 
grangian relaxation to solve this problem. 

At each stage of diagnosis, we consider a set of avail- 
able tests TA which can provide some information about 
the system. Initially, TA contains all tests except those 
that can detect all or no faults. Then, we recommend 
a test using a local, step-by-step optimization algorithm 
developed by Johnson [19]. In this approach, a test t* 
from the set of available tests TA is selected, if it maxi- 
mizes the information gain per unit cost of the test: 

f IG(A,tj) , s 

k = arg maxi — — } (2) 

i cj 

where A is the ambiguity group at the current stage of 
diagnosis, and IG(A, tj ) is the information gain given by: 

IG(A, tj) = ~{P'(A jp ) log 2 P'(A ]p ) + P‘(A jf ) log 2 P'(Ajj)} 

(3) 

In (3), {Aj p ,Ajj} are the subsets of the ambiguity group 
A corresponding to pass and fail outcomes of test tj 
such that Aj p V Ajf = A, and P'(Aj p ) =P(Aj p )/P(A), 


P'(Ajf) = P(Aj/)/P(A) are the conditional probabili- 
ties of the pass and fail outcomes of test tj, and P(Aj p ) 
and P(Ajj) are the probabilities of ambiguity groups Aj p 
and Ajf, respectively. 

In general, P(A = Q(L; Fi, F-i , needed in 
the evaluation of information gain, can be computed as 
follows: 

P(A) = P((u,t 1 ^ i )nG) (4) 

where m is the number of minimal candidates, and for 

* = 1, ..., m are the minimal candidates of the ambiguity 
group A and ~ denotes the logical NOT operator[20][23]. 
In addition, for notational clarity, we use the same no- 
tation for expressing a set and its Boolean expression. 
Furthermore, we use D and U as Boolean product and 
sum, i.e., conjunction (AND) and disjunction (OR) of at 
least two Boolean expressions, respectively [20] 3 . We de- 
fine G s C G as a set containing single failures of G, and 
G w as its complement within G, i.e., G w = G c , = G — G,. 
Thus, by expanding (4) and using the associative law of 
Boolean algebra [21], we have: 

P{A) = F((U[l^j) C\G S C\G W ) 

= P{{{'J%^ti)^G,)r\G w ) (5) 

where - denotes the logical NOT operator[20][23]. 

By defining 4> t - = \k,T) (5, for i = 1, ..., m, (5) reduces to: 

P(i) = ?((uf =1 $i)nG B ) (6) 

This further reduces to: 

P(A) = P(uti*i)-P({V?=i*i)r\G w ) (7) 

= F(uf =1 $ i )-P(uf = i & „,. €G „($ i ntr i )) 

Since fl G s , the second term of (7) should be 

considered only for those Wj that do not have any thing 
in common with G,. Further, it is sufficient to evaluate 
the second term of (7) for Wj E I(G W ), i.e., irreducible 
set of G w . This is because, if ta* C Wj , any set satisfying 
the Boolean expression ($j fl t Oj) will satisfy (<E>i 0 wt). 

The problem in (7) is equivalent to the problem of 
finding the probability of a sum of non-disjoint sets. 
This problem is known as the sums of products prob- 
lem, and its computational complexity is NP-hard [22]. 
Veeraraghavan et al [23] considered the sum (product) 
of products (sums) problem and proposed an efficient 
Boolean algebraic algorithm, the so-called GKG-VT al- 
gorithm, for its solution. In this algorithm, the probabil- 
ity of the union of a set of events can be evaluated using 
the following equation [23]: 

3 This is in contrast to V and A, which are used to denote 
Boolean conjunction and disjunction of two sets. 
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P{Uf IP<} = P{(JPj ) U (I Pi I Pi) U — ( IP ! . . . /Pp_i/P p )} 

( 8 ) 

Since these resulting sets are disjoint, their probabilities 
are added to obtain the probability of the desired event. 
Thus, the first term in (7) can be evaluated as follows: 

P^U^*,)) = P(ut =1 ^i\G a )P(G,) (9) 

= P(U? =1 %\G S ) n (!-P(^)) 

ajcGG, 

where P(U™ 1 ’f,j6 r 3 ) is the probability of sum of disjoint 
products ’f,- in S — G, domain, i.e., the number of total 
variables reduces from m to m — |C? 3 |. In addition, in 
digraph models without AND nodes, G w = 0. Thus, 
the second term in (7) is zero, and P(A) reduces to (9). 
Furthermore, in these systems, since the set of minimal 
candidates is the minimal hitting set for the set F = 
{Fi, F 2 , ..., Fl } (see Lemma 1), (9) can be written as: 

p(A) = P(nf =1 p,|(5 s )) HO- p( 5 *)) ( 10 ) 

Sk£Gs 

Thus, in systems having no redundancy, instead of 
evaluating the probability of ambiguity group A using 
minimal candidates at each stage of diagnosis, we can di- 
rectly evaluate this probability using F = {Fi, Fi}. 
Furthermore, since the GKG-VT algorithm evaluates 
the probability of {u( ,+1 JP,-} (sum of products) and 
{nf +1 /P,-} (product of sums) sequentially, i.e.,: 

P{U? +1 IP} = P{U?IP} + PULP: . . .JP p IP p+1 )} 

(11) 

P{n? +1 JP,} = P{n?/P { }-P{(7Pi . . .lP p LP p+1 )} (12) 

the probability of Aj/. = 0(L + 1 ; Pi , . . . , Fl , TSj A G e ; 
G) and Aj p = 0(1; (FiATSj ), . . , (P L ATP/); (GVTS,)) 
are: 

P (Ay/) = (P(nf = iP,|G s ) - P (fi ... F l TSJ~aCF\G s )) 


Note that when so G Pi (see the fifth property of compact 
set notation), we should split F\ into two disjoint sets 
{s 0 } and Pi - {s 0 }. 

One of the advantages of this approach, compared to 
the one in [24], [25], is that the probability of an ambi- 
guity group at the current stage of diagnosis is evaluated 
using the probability of ambiguity group at the previous 
stage. Furthermore, using this recursive approach, the 
probability of any hypothesis at the current stage of di- 
agnosis can be evaluated. The computational complexity 
of this approach is strongly related to the structure of the 
B matrix. 

In summary, at each stage of diagnosis, for a given 
set G,, we g enerate t he set of disjoint events for 
Pi P 2 ... Fl TSj A G c , evaluate P(Ajj) and P(Aj p ), 
and recommend a test with the highest information gain. 
Based on the test outcome, we update the set of avail- 
able tests TA, i.e., we remove the recommended tests 
and those tests that do not give any information. This 
procedure is continued until: (1) at least a failure source 
is isolated, or (2) no test gives further information. 

The former corresponds to the case when the cardi- 
nality of one or more Pj in the ambiguity group A is 
one. After repairing/replacing the failure sources in P,’s 
with cardinality one, we update the current ambiguity 
group and the set of available tests as follows: (i) add 
repaired/replaced components to the set of good com- 
ponents G, (ii) remove Pj’s containing at least a re- 
paired/replaced component from ambiguity group A. If 
all the Pi’s are removed, we set the current ambiguity 
to A=0(l;Pi = S — G;G), and (iii) update the set of 
available tests TA to all tests except previously applied 
passed tests and those tests that do not give any new in- 
formation, i.e., those tests tj such that TSj A G c is either 
an empty set, or a superset of one of Pj (see the sec- 
ond property of compact set notation). This procedure 
is continued until the set of good components G contains 
all the elements. 

In the second case, we can select either block or se- 
quential replacement. In block replacement, we repair all 
the suspected faults, i.e., S-G, and stop testing. In se- 
quential replacement, we repair/replace most likely can- 
didates and continue testing. The problem of finding the 
most likely candidates is as follows: 


n (1 -pm 




maximize 


= P(A) - P(Pi ...Fl TSj A G C \G„) (1 - p(«)) sub ject to 

P(A JP ) = P(A) — P(Ajj) 

= P(Pi ... Fl TSj A G c |G a ) (l-p(s*)) 


nr^pOh-r (i-K*)) (1 - ri) as) 

££1 F n*i > 1 ; i = (14) 

El’Ll r < M - 1 ; w k e g w (15) 
Xi G (0,1) ; i = l,...,m (16) 




where P;,- = 1 if Sj G Pi; otherwise Fk = 0; l?*,- = 1 
if Si G w k ; otherwise I'm = 0. Constraints (14) and 
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(15) ensure that the most likely candidates are consis- 
tent with the failed and passed tests. Furthermore, us- 
ing Lemmas 1 and 6, it is sufficient to consider (15) for 
G w —{w k |t£>jfc 6 I(G W ) and w k A G s = 0}, where I(G W ) 
is the irreducible set of G w . By taking the logarithm 
of the objective function in (13), an equivalent objective 
function is: 

maximize YT=i lo S ( 17 ) 

subject to (14), (15) and (16). Thus, the problem of 
finding the most likely candidates for redundant systems 
reduces to a generalized set-covering problem [26], [27]. 
Furthermore, in systems having no redundancy, G w = 
0. Consequently, (15) is eliminated, and the problem 
reduces to a traditional set-covering problem [31]. 

The generalized set covering problem is solved via a 
Lagrangian relaxation technique. In this technique, the 
constraints (14) and (15) are relaxed via Lagrange mul- 
tipliers. The solution of the relaxed problem is an up- 
per bound for the covering problem. The multipliers are 
updated iteratively via subgradient optimization to min- 
imize this upper bound. In addition, the upper bound 
and the relaxed solution can be used to develop the best 
feasible solution for the generalized set covering problem. 
A nice feature of the relaxation approach is that the dif- 
ference between the upper bound and the best feasible 
solution, termed the approximate duality gap, provides 
a measure of suboptimality of the feasible solution. The 
details may be found in [30, 26, 27], 

Using the solution of the generalized set-covering prob- 
lem, the most likely candidate is S m = {s; \xi — 1}- Note 
that, usually p(s,-) < 0.5 for i = 1 ,..,m. In this spe- 
cial case, the most likely candidate is one of the minimal 
candidates of the current ambiguity group. After repair- 
ing the most likely candidates, we update the ambiguity 
group and continue testing. 

In order to solve multiple fault isolation problems in 
larger systems with as many as 10,000 failure sources, 
we employ the following simplified approach to compute 
information gain in (3). If the ambiguity group A at the 
current stage of diagnosis contains more than one single 
failure source; i.e., intersection of F, ’s contains more than 
one fault, we select a test t k that maximizes the informa- 
tion gain per unit cost of the test based on a single fault 
ambiguity group. That is, in this case, IG(A,tj) in (2) 
is the information gain based on single fault assumption, 
p'(Aj p ) and p'(Ajf) in (3) are the conditional probabili- 
ties of the pass and fail outcomes of test tj based on the 
single fault assumption in ambiguity group A. However, 
based on the test outcome, we update the multiple fault 
ambiguity group (see properties 1 and 2 of the compact 


set notation). This procedure is continued until: (1) at 
least a failure source is isolated, i.e., the cardinality of 
one or more of F,-’s is one, (2) no test gives further infor- 
mation, (3) the cardinality of intersection of Fj’s is one, 
i.e., there exists a set of masking sets for the single fault 
in that intersection, or (4) the set of good components G 
contains all the components, G = S. 

The first and second cases are the same as those in the 
previous approach. In the third case, we recommend a 
test based on a measure of information content in [28], 
and continue testing until the first or second condition is 
reached. In the fourth case, since all the components are 
good, no further action is needed. This approach has less 
computational complexity, but higher testing cost com- 
pared to the previous approach, based on sum (product) 
of disjoint products (sums). 

In addition to a set of comprehensive synthetic prob- 
lems, we have applied the algorithms presented in this 
paper to several real-world systems. These include: (1) 
the Space Shuttle Main Propulsion System with 7271 
failure sources and 1292 AND nodes [6], (2) the F18- 
Flight control system yyith 148 failure sources and 78 
AND nodes [29] with failure sources limited to singletins 
and doubletons, (3) the anticollision light control system 
of the Sea Hawk helicopter with 51 failure sources and 55 
tests, (4) the stabilator system of the Black Hawk heli- 
copter with 238 failure sources and 834 tests, and (5) the 
engine torque monitoring system used in CH-53E heli- 
copter with 116 failure sources and 75 tests. In the latter 
three cases, static and dynamic multiple fault diagnos- 
tic strategies subject to various constraints on available 
resources, setup operations, and initial failure symptoms 
have been implemented, along with interfaces to interac- 
tive electronic technical manuals and multi-media docu- 
mentation. 

8 Conclusion 

In this paper, we considered the problem of constructing 
near-optimal test sequencing algorithms for diagnosing 
multiple faults in systems modeled as digraphs. This 
problem involves two sequential steps: (1) generation of 
a binary test matrix, and (2) design of a multiple-fault 
testing strategy that unambiguously isolates the multiple 
failures with minimum expected testing cost (time). 

In systems without redundancy, a binary test matrix 
denoting the full-order dependency among single failures 
and the tests forms the basis for diagnosing single, as well 
as multiple faults in the system. In order to diagnose 
multiple faults in systems with redundancy, this binary 
test matrix is augmented to capture the failure signatures 
of minimal-faults. Using a top-down recursive procedure, 
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we developed an algorithm to find all the minimal faults 
and their failure signatures in redundant systems, and 
using a bottom-up procedure, we presented an efficient 
algorithm to find minimal faults up to a limited size. 

After generating the binary test matrix, the problem is 
to design a practical multiple fault test sequencing algo- 
rithm. The computational and storage complexity of an 
optimal multiple fault strategy are super-exponential in. 
the number of failure sources, m. We presented several 
near-optimal algorithms that provide a trade-off between 
optimality and computational complexity. Firstly, we 
extended the single-fault strategy of our previous work 
[1, 10] to diagnose multiple faults by successively isolating 
the potential single-fault candidates, then double-fault 
candidates, and so on. This is one of the simplest mul- 
tiple fault strategies that one can use. In this approach, 
the storage complexity at each OR node of the AND/OR 
graph is the same as that in a single fault strategy. 

We then extended the single fault sequential testing 
strategies to a class of Sure strategies. The basic idea 
of these strategies is to find one or more definitely failed 
components, while not making an error when other co- 
existing faults are present. We explored three different 
approaches for the application of additional tests, result- 
ing in Surel-3 strategies. 

Some of the advantages of using Sure strategies are: 

(1) the inherent combinatorial explosion that occurs in 
generating an optimal multiple fault strategy is reduced 
substantially, (2) the first iteration of the Sure strategies 
results in the same tree as in the single fault (minimal 
fault) strategy for the system without (with) redundancy, 
and therefore, these strategies isolate a single fault (mini- 
mal fault) with the smallest average cost, while not mak- 
ing an error when multiple faults are present. Computa- 
tional complexity of this approach is strictly related to 
the structure of the system, i.e., the test matrix B. 

In order to eliminate the problems associated with the 
size of the complete diagnostic strategy, the test strategy 
can be generated "on-line”. That is, instead of gener- 
ating the entire diagnostic tree, the interactive strategy 
only suggests the next test to be applied given the out- 
comes of previously applied tests. We employed concepts 
from information theory and Lagrangian relaxation to 
generate several on-line diagnostic strategies. In these 
strategies, at each stage of diagnosis, a test with the 
highest information gain is recommended. The compu- 
tation of information gain associated with a test requires 
the probabilities of ambiguity groups corresponding to 
pass and fail outcomes of the test. An efficient computa- 
tional approach based on sum (product) of disjoint prod- 
ucts (sums) is used to evaluate these probabilities. How- 
ever, the computational complexity of this approach is [9] 
strongly related to the structure of the binary test matrix 


B and previously applied tests. In order to derive a prac- 
tical (albeit suboptimal) on-line diagnostic strategy ca- 
pable of diagnosing multiple faults in large scale systems, 
we estimated these probabilities via: (1) the probabilities 
of single failures at the ambiguity group, i.e., Ai< t <x,J*i, 
and (2) the probability of ambiguity group based on all 
the suspected faults, i.e., 0(1; Fi = {S — G};G) [28]. 
Note that, these estimates constitute the lower and up- 
per bounds for the probability of ambiguity group. We 
expect to investigate tighter bounds, as well as other 
measures for recommending a test, and compare their 
efficiencies in our future efforts. 
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Problem Formulation 



We consider the problem of sequencing tests to isolate mul- 
tiple faults in redundant (fault-tolerant) systems with mini- 
mum expected testing cost (time). It can be shown that single 
faults and minimal faults, i.e., minimum number of failures 
with a failure signature different from the union of failure 
signatures of individual failures, together with their failure 
signatures, constitute the necessary information for fault di- 
agnosis in redundant systems. In this paper, we develop an 
algorithm to find all the minimal faults and their failure sig- 
natures. Then, we extend the Sure diagnostic strategies [1] 
of our previous work to diagnose multiple faults in redundant 
systems. The proposed algorithms and strategies are illus- 
trated using several examples. 

1 Introduction 

Diagnosis is fundamentally a process of identifying the 
cause of a malfunction by observing its effects at vari- 
ous monitoring points in a system. Fault diagnosis in 
large-scale systems that are products of modern technol- 
ogy present formidable challenges to manufacturers and 
users. This is due to the large number of failure sources 
and the need to quickly isolate and rectify such failures 
with minimal down time. In addition, for redundant sys- 
tems and systems with little or no opportunity for repair 
or maintenance during the operation (e.g., Hubble tele- 
scope, space station), the assumption of at most a single 
failure in the system between consecutive maintenance 
actions is unrealistic. ' 

In this paper, we consider the problem of construct- 
ing test sequencing algorithms for diagnosing multiple 
faults in redundant systems. Our approach is to: (1) 
generate all minimal faults and their failure signatures 
in the system, and (2) extend the multiple fault sequen- 
tial testing strategies of our previous work [1] to fault- 
tolerant systems. In addition, the minimal fault analy- 
sis can be used for a quantitative evaluation of system 
dependability^, 4]. 
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We assume that the system is modeled by a directed 
graph (digraph) DG = {S,T,A,E}, where E denotes 
the set of directed edges specifying the functional infor- 
mation flow in the system, and 

• S = {sj, ..., s m } is a finite set of independent failure 
sources associated with the system; 

• T = {f 1,^21 —jfn} is afinite set of n available binary 
outcome tests, where the integrity of system failure 
sources/components/modules can be ascertained; 

• A = {ai, ..., a is a finite set of AND nodes repre- 
senting system redundancies. 

The input requirements of the various nodes of the 
digraph model are as follows: 

1. Failure node : A priori probability vector of failure 
nodes P = [p(si), ...,p(s m )], where p(si) is the a 
priori probability of failure source s,-. 

2. Test node : A set of test costs C = {ci, C 2 , ...,c„}, 
where Cj is the cost of applying test tj . 

3. AND node : Two sets F = {/i, ...,/jf} and G = 
{<7i, —,9k), where /* and gk denote A-out-of-p* 
logic for AND node a*, i.e., AND node a* has gk 
inputs and a failure must occur in at least inputs 
of this AND node for the faults to propagate to the 
output. 

The problem is to design a testing strategy that un- 
ambiguously isolates the failure sources with minimum 
expected testing cost. The sequential test strategy is 
represented in the form of an AND/OR decision tree, 
where the OR nodes represent the suspect sets of fail- 
ure sources, AND nodes are tests applied at various OR 
nodes, and the leaves are the isolated failure sources. 

3 Minimal Fault Algorithm 

In digraph models without AND nodes, i.e., having no re- 
dundancy, the test matrix (fault dictionary) denotes the 
full-order dependency among single failures and tests in 
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the system. Assuming that the failure signature of a mul- 
tiple failure is the union of failure signatures of individual 
failures, the binary test matrix forms the knowledge base 
for diagnosing single faults, as well as multiple faults, in 
a system having no redundancy. However, this property 
is not valid for a digraph model with AND nodes. It 
can be shown that single faults and minimal faults, i.e., 
minimum number of failures with a failure signature dif- 
ferent from the union of failure signatures of individual 
failures, together with their failure signatures, contain all 
the necessary information for fault diagnosis in digraph 
models with AND nodes [10]. 

In order to generate minimal faults and their failure 
signatures, we compute the following dependency matri- 
ces using the reachability analysis algorithm [2, 6]: 

• Failure source-test dependency matrix D = [dy] is 
a binary matrix of dimension m x n , where dy = 1 
if tj monitors failure source sy otherwise, dy = 0; 

• Failure source-AND node dependency matrix B = 

is a binary matrix of dimension mxz , where 

2 — 12jzzi(.9j + 1); &y(o) = 1 if a failure of s,- can 
reach the output of AND node ; otherwise, 6y(o) = 
0; and 6yy) = 1 if a failure of s,- can reach the Ith 
input of AND node aj ; otherwise, 6y (j) = 0; 

• AND node-test dependency matrix E = [ey] is a 
binary matrix of dimension K x n , where ey = 1 if 
tj monitors AND node a,-; otherwise, ey = 0; 

• AND node-AND node dependency matrix R = 
[r,j(j)] is a binary matrix of dimension K x z, where 
r ij(o) = 1 if a failure at the output of AND node 
a,- can reach the output of AND node aj ; otherwise, 
rym = 0; and rym = 1 if a failure at the output of 
AND node a,- can reach the /th input of AND node 
aj\ otherwise, ry-(j) = 0; 

• AND node-AND node reachability matrix Q = [gy] 
is a binary matrix of dimension KxK, where gy = 1 
if there is at least a path between a,- and aj ; other- 
wise, qij = 0. 

Note that we can generate AND node-AND node 
reachability matrix Q by setting AND node’s logic of 
/*-out-of-g* to l-out-of-g* in the reachability analysis al- 
gorithm, i.e., AND nodes devolve into OR nodes. 

For. convenience, given a binary matrix X — [zy] of 
dimension k\ x Jfcj, we define Xr,- for i = 1, ..., Jfei as its 
ith row, and Xcj for j = 1, ..., Jbj as its jth column. For 
example, Dr,- for i = 1, ..., m lists all the tests that can 
detect failure source s,-, and Dcj for j = 1, ..., n indicates 
all failure sources detectable by test tj. 


Using these matrices, the minimal fault algorithm finds 
the minimal faults of the digraph model via the following 
steps: 

1. Sort a subset of AND nodes As C A to be processed. 

2. Generate minimal combinations of AND nodes and 
failure nodes that propagate to the output of every 
AND node at G As, i.e., MC(a*) for a* € As. 

3. Generate minimal faults of each AND node in As, 
i.e., MF(ajc) for a* G As. 

4. Generate the minimal faults of the digraph model 
(MFd) using the minimal faults of AND nodes. 

3.1 Step 1 - Sorting the AND nodes 

Since we need to process only those AND nodes As C A 
for which there exists at least a path to a test, the algo- 
rithm sorts the AND nodes in As such that an AND node 
will be processed before any other AND nodes reachable 
from it. This step prevents the algorithm from perform- ' 
ing similar operations repeatedly. 

The procedure for finding and sorting As is as follows: 
(1) Find a subset of AND nodes Ae C A such that an 
AND node at G Ae can be detected by at least a test, i.e., 

Ae = U Ecj for j = 1 n, (2) Using AND node-AND 

node reachability matrix Q, find the subset of AND nodes 
As that reach Ae, i.e., As = U ak £AeQct, and (3) Sort the 
AND nodes in As in the ascending order of the number 
of AND noues reaching them. Note that the number of 
AND nodes reaching AND node aj is J3 *Li Ikj ■ 

3.2 Step 2 - Generation of minimal combina- 
tions for each AND node 

Using the binary matrices, the minimal fault algorithm 
generates minimal combinations of AND nodes and fail- 
ure nodes for each AND node in As , i.e., MC(a*) for 
a* G As. The procedure is as follows: (1) For each AND 
node a* , determine the failure sources and AND nodes 
that can reach the inputs and output of at , i.e., Set (/) for 
l = 0,1 where Sc*(/) = Bc t (l) U Rct(l), (2) Re- 

move Sct(0) from Sct(l) for l = 1, ..., gt, that is, remove 
single failures affecting the AND node output from its in- 
put signatures. (3) Because of the /t-out-of-g* logic, all 
combinations of Set (I) for / = 1, ...,gt containing sets of 
cardinality ft are considered. For example, for an AND 
node, say a*, with 2-out-of-3 logic, we consider the fol- 
lowing combinations: (Sct(l), Sc* (2)), (5c*(l), Sc* (3)) 
and (5c* (2), 5c* (3)). Then, using 5c*(/)’s combinations, 
generate the minimal combinations of AND nodes and 
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failure nodes for each AND node by one of the follow- 
ing three approaches: (a) Minimal hitting set 1 method 
using a breadth first search [7], (b) Minimal hitting set 
method using a depth first search, and (c) Binary de- 
cision diagrams [8]. When </*= 2 for k = 1, K, the 
minimal combinations for each AND node can be found 
using the cross-product of two sets Scjt(l) and Set (2) for 
k = l,...,K [9]. 

Note that we can consider a limit L for the number 
of failure sources in the minimal combinations of AND 
nodes. In the first and second approaches (i.e., (a) and 
(b)), those combinations with more than L failure sources 
are not expanded. In the third approach (i.e., (c)), at 
first the decision diagram is generated, and then, the 
combinations with more than L faults are eliminated. 
However, because of the small number of sets, i.e., gt for 
k = 1, ..., K, usually 2 or 3, there is almost no difference 
in using any of these three approaches. 

3.3 Step 3 - Minimal faults for each AND node 

After generating minimal combinations for each AND 
node in As, we process one AND node at a time. The 
subroutine for this part is a recursive function, and for 
simplicity, we call it MFG (Minimal Fault Generator): 
MFG ( AND-node, fault-list, AND-list, level, solved- 
AND-nodes). In order to find minimal faults for an AND 
node, say a,, we call MFG as follows: MFG(a,-, 0, 0, 0, 
0). MFG adds a< to AND-list, and considers it as a level- 
zero AND node. Then, if the minimal faults of this AND 
node have already been found, MFG adds one of the a,-’s 
minimal faults to the fault-list. Otherwise, it adds one of 
the af’s minimal combinations to the fault-list, removes 
solved- AND-nodes from that list, and sets the level of the 
new AND nodes in the fault-list to level+l. Then, it re- 
moves the AND node with the highest level, say a } - , from 
the fault-list, and calls MFG as follows: MFG(ay, fault- 
list, AND-list, level+l, solved-AND-nodes). This proce- 
dure is continued until: (a) no AND node remains in the 
fault-list, or (b) the level of the selected AND node is less 
than or equal to the level of one or more AND nodes in 
the AND-list, or (c) the algorithm picks an AND node 
that has already been processed, i.e., the AND node is 
in the AND-list. 

In the first case, this combination is compared with 
other combinations created for AND node a,-. If it is a 
super set of one of them, MFG does not do any thing, 
and returns. If it is a subset of one or more of them, 
the algorithm removes the super sets, and it stores this 
combination, as well as the set of AND nodes affected 

1 A hitting set for a collection of sets C is a set H C Uxgc-^ 
such that HnX # 0 for each X € C 


by this combination, i.e., AND-list without considering 
levels, and returns. 

In the second case, the algorithm adds the AND nodes 
with levels greater than or equal to the level of the se- 
lected AND node in the AND-list to the solved-AND- 
nodes, resets their levels to zero, and removes these AND 
nodes from the fault-list, if any. Then, it processes the 
selected AND node. 

In the third case, if the AND node has already been 
solved, i.e., it belongs to the solved-AND-nodes, it re- 
moves the AND node from the fault-list, if any, and picks 
another AND node and processes it. Otherwise, the algo- 
rithm returns without doing any thing. This latter step 
prevents the algorithm from entering an infinite loop, 
when a cycle is encountered. 

Note that we can consider a limit L for the number of 
faults in the minimal faults of AND nodes. In this case, 
MFG checks whether the number of faults in the fault-list 
is greater than the limit L or not. If the number of faults 
in the fault-list is greater than L, it returns; otherwise, 
it expands the AND node as before. 

3.4 Step 4 - Minimal faults of digraph models 

After generating the minimal faults of the AND nodes, 
the algorithm generates the minimal faults of the di- 
graph model (MFd). Firstly, using the AND node- test 
dependency matrix E, the algorithm removes the mini- 
mal faults of those AND nodes that cannot be detected 
by any test. Secondly, if a set of faults belongs to more 
than one minimal combination of AND nodes, the algo- 
rithm considers only one of them, and stores the union of 
corresponding AND-list as the set of AND nodes affected 
by the set of faults. Then, using matrices D, E and AND- 
list, the algorithm generates the failure signatures of re- 
maining faults, i.e., U,i<z Wj Dri U ak ^AND-iistErk, where 
wj is a minimal fault, and AND-list is the list of AND 
nodes affected by t Vj. Note that the remaining faults may 
contain super sets, and because of the test points in the 
digraph model, a super set may/may not be a minimal 
fault of the digraph model. Those super sets which have 
the same failure signatures as the union of the failure 
signatures of their subsets are removed. For example, let 
us consider a digraph model with failure sources S={si, 
s 2 , AND nodes A={ai, a 2 } and tests T={fx> t 2 }. 
A failure of si and s 2 can be propagated to the output 
of the first AND node, i.e., ox, and can be detected by 
test ti. A failure of «x, s 2 and S3 propagates to the out- 
put of second AND node, i.e., a 2 , and is detected by test 
<3. Therefore, { «i, s 2 } and { «x, s 2 , s 3 } are mini- 
mal faults of AND nodes ax and a 2 , respectively, and the 
minimal fault of a 2 is a super set of the minimal fault of 
a\. However, because of the different failure signatures, 
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i.e., {si, s 2 } is detected by t\ and {si, S 2 , £ 3 } is detected 
by t\ and t 2 , these two sets Me the minimal faults of the 
digraph model. 

In order to illustrate the minimal fault algorithm, we 
present the following examples. 

Example 1: Consider the digraph model in Figure 1. It 
consists of four failure sources S = {si,s 2 , S 3 , S 4 }, three 
AND nodes A = {ai, a 2 , 03 } and one test point T = {ti}. 
{si , s 2 , S 3 , S 4 } is the minimal fault for this digraph model. 



Figure 1: Digraph model of Example 1 

The minimal fault algorithm works as follows: 

• Step T. The AND nodes Me sorted as follows: ai, 
a 2 and 03 . 

• Step 2: The minimal combinations of failure sources 
and AND nodes for each AND node Me as fol- 
lows: MC(ai) = {{s 3 ,s 4 }}; MC(o 2 ) = {{ai,s 2 } , 
{ai,a 3 }}; MC(o 3 ) = {{si,a t ,a 2 }}. 

• Step 3: {s 3 l S 4 }, {s 2 ,S 3 ,S 4 } and {si,s 2 ,S 3 ,S 4 } Me the 
minimal faults of AND nodes oj, a 2 and 03 , respec- 
tively. 

• Step 4: No test can detect {ai,a 2 }. Therefore, the 
minimal faults of these AND nodes should be elimi- 
nated. Thus, {si, s 2 , S 3 , S 4 } is the only minimal fault 
of the digraph model in Figure 1. 

Example 2: Consider the digraph model of the F18 
Flight Control System (FCS) for the left Leading Edge 
Flap (LEF) in Figure 2, which was used as an example 
in [3]. The minimal faults for this digraph model Me 
{FCCA, FCCB}, {FCCA, CHNL3}, {FCCB, CHNL2}, 
and {CHNL2, CHNL3}. 

Example 3: Consider the digraph model in Figure 3, 
which was used as an example in [9]. The minimal faults 
for this digraph model Me {s 2) s 3 }, {s 3 , S4}, {s 3 , S5}, and 
{«9,«io}. 

3.5 Multiple Fault Strategy 

It can be shown that the computational and storage com- 
plexity of designing an optimal multiple-fault diagnostic 
strategy Me exponential in m [10J. In order to reduce the 
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Figure 2: Digraph Model of F18 FCS LEF of Example 2 



Figure 3: Digraph Model of Example 3 

storage complexity, we use the compact multiple fault no- 
tation for the multiple fault ambiguity group at each OR 
node [1], [5]. Furthermore, in order to reduce computa- 
tional complexity, we extend the Sure diagnostic strate- 
gies of our previous work [ 1 ] to redundant systems. 

3.5.1 Compact Notation 

We use the compact multiple fault notation A = 
Q(L; Fi, ..., Fl ; G) for the multiple fault ambiguity group 
at each OR node in systems without AND nodes [1], [5]. 
The Fi for i — 1, ..., L and G Me subsets of S =S U {so} 
= {s 0 ,si,...,s m }, where so is the fault-free condition; 
G is the set of known good failure sources (failure free 
sources), and Fi for i = 1, ..., L are sets known to contain 
at least one definitely failed component each, i.e., 

Q(L;F u F 2 ,...,Fl]G) = {XCS\ 

X C\ Fi ^ 0 for i = 1, ..., L, and X fl G = 0 

In the following, we summMize some of the properties of 
compact set notation: 

• If E D Fi for some i, then 0(L + 1; F\, F 2 Fl, 

E; (?)=©( L] F u F 2 , ... ,F L \ G). 

• 0(L; F u F 2 F l ; G) = 0(L; Fi nG e , F 2 n G e , 

..., Fl H G e ; G ), where superscript c denotes com- 
plement, i.e., G c = S — G. 

• the worst case storage complexity of compact set 
notation for an OR node is O(mn). 
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• Multiple fault logic using the compact set notation 
is as follows: the initial hypothesis set is the set of 
all subsets of S, i.e., A— 0(1; F\ — S ; G = 0). 
After performing a test, say tj , the hypothesis set A 
— 0(L; Fi,..., Ft; G) is pruned as follows: 

( Q(L; (Ft n TSf), .... (Ft n TSf); if tj passes 
A < (GUTSj)) 

{ Q(L + 1; F u Ft, TSj n G e \ G) if tj fails 

• the failure sources belong to Fi with cardinality one 
are definitely faulty (one- for- sure). 

In systems with AND nodes, each row of the test matrix 
corresponds to a subset of S= {si, ..., s m } . For sim- 
plicity, let us assume that the new test matrix contains 
m n — m Am j rows, where mj is the number of minimal 
faults. We define W={twi, ...,tu mn }, where is,- = {s,} for 
i = 1, ..., m, and u>j C S for m + 1 < t < m„. Therefore, 
the Fi for i = 1, ..., L and G are subsets of W=W U {w 0 j 
= {u/o, wi, ..., w m „}, where {u>o} = {s 0 } and 

Q(L;F 1 ,F2,...,F L ;G) = {XCWI 

X H F{ 0 for t = 1, ..., L, and X(lG = 0} 

Based on these definitions, it can easily be shown that: 

Lemma 1 : Using one-for-sure condition, if the cardi- 
nality of any Fi is one, all the failure sources in Wj £ Fi 
are faulty, and if the cardinality of Fi is greater than 
one, all the failure sources in n u , J ef' i u;j, are definitely 
faulty. It is obvious that these two conditions can be 
combined as follows: all the failure sources in rVfgfjtiij, 
for i = 1, ..., L, are definitely faulty. 

Lemma 2 : Let us assume that we repaired definitely 
failed components X C S, and there exist a ruy £ G such 
that \wj — X\ = 1 and Sk = | Wj — X|. Therefore, sj, is 
good and should be added to the good component G. 

Lemma 3 : If we repair definitely failed components 
wi = {s,}, and there exists a t Uj such that s,- £ Wj, 
then Wj should be added to the good subset G. 

3.5.2 Sure Strategies 

The basic idea of Sure strategies is to find one or more 
definitely failed components, while not making an error 
when other co-existing faults are present [1]. In these 
strategies, at each stage of diagnosis, we consider the 
minimal candidate set of the multiple fault suspect set 
corresponding to the OR node at that stage, and in- 
voke the single fault strategy to isolate these candidates. 
Then, we propagate multiple fault suspect set through 
the resulting diagnostic tree. We repeat these procedures 
for each leaf node of the tree until: (1) the intersection of 
minimal candidates is not empty, i.e., the corresponding 


failure sources are definitely faulty, or (2) no test gives 
further information. The former corresponds to the case 
when the cardinality of one or more F,- in the ambiguity 
group is one. Note that in these strategies, we only repair 
definitely failed components. 

Example 4: Consider the digraph model in Figure 
4. The digraph model consists of failure source 
S = {si,S 2)«3}> AND nodes A = {<11,012,03} and 
tests T = {ti,<2,t3}. u^lsi)^}, «>5={S2,S3} and 

u> 6 ={si, s 2 , S3} are the minimal faults for this digraph 
model. The binary test matrix of the digraph model 
is shown in Figure 5. Figure 6 shows the multiple fault 
strategy for this system, where ACTION nodes represent 
the actions to be performed at that stage of diagnosis and 
Ai denotes the ambiguity group corresponding to the ith 
OR node, and At = 0(1; (two, u>i, u> 2 , w 3 > w 4 , u>5, u^}; 
0); A 2 = 0(1; {mo, tui, w 2 , w 3 , u> 5 }; {u>4, u>s}); A 3 = 
0(1; {u> 4 , u/ 6 }; 0); = 0(1; {it>o, v>i, w 2 , u> 3 }; W 

w 5 , w 6 }); A 5 = 0(1; {uj 5 }; {u> 4 , u> 6 }); = ©(*; {^4}; 

{tu 6 }); and A 7 = 0(1; {u>g}; 0)- 

Note that we applied Lemma 2 to A 5 and Ag- For ex- 
ample, As = 0(1; {w 5 }; {w 4 , w 6 }). Thus, w 5 is definitely 
faulty, i.e., s 2 and s 3 are faulty. After repairing these fail- 
ures, there is no need to apply additional tests. This is 
because w 3 belongs to G, and therefore, si is good; G=S. 
One interesting point to note here is that we should not 
repair definitely failed components at intermediate nodes 
of the diagnostic strategy, because it may mask the fail- 
ure of other faults. For example, A 3 = 0(1; {w 4 , u>6}; 0). 
Using Lemma 1, w 4 H W6 = {si , S2} are definitely faulty. 
If we repair sj and S2 at this stage of diagnosis, a failure 
of S3 will go undetected. 



Figure 4: Digraph model with AND node 

In addition to a set of comprehensive synthetic prob- 
lems, we have applied the algorithms presented in this 
paper and those of [10] to several real-world systems. 
These include: (1) the Space Shuttle Main Propulsion 
System with 7271 failure sources and 1292 AND nodes 
[9], (2) the F18-Flight control system with 148 failure 
sources and 78 AND nodes [3] with failure sources lim- 
ited to singletons and doubletons, (3) the anticollision 
light control system of the Sea Hawk helicopter with 51 
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Figure 5: Test matrix 



failure sources and 55 tests, (4) the stabilator system of 
the Black Hawk helicopter with 238 failure sources and 
834 tests, and (5) the engine torque monitoring-system 
used in CH-53E helicopter with 116 failure sources and 75 
tests. In the latter three cases, static and dynamic mul- 
tiple fault diagnostic strategies subject to various con- 
straints on available resources, setup operations, and ini- 
tial failure symptoms have been implemented, along with 
interfaces to interactive electronic technical manuals and 
multi-media documentation. 

4 Conclusion 

In this paper, we presented an algorithm to find all min- 
imal faults in a digraph model and to generate their fail- 
ure signatures. Further, we extended the multiple fault 
sequential testing strategies of our previous work [1] to 
_ redundant systems. Computational results indicate that 
these strategies can be used on systems with as many 


as 600 failure sources and 600 tests. Furthermore, using 
Sure strategies, a test strategy can be generated ” on-line” 
to diagnose multiple faults in larger systems. That is, in- 
stead of generating the entire diagnostic tree, the interac- 
tive test generation program only suggests the next test 
to be applied given the outcomes of previously applied 
tests, and generates the path leading to the isolation of 
multiple failures in a system. 
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Abstract 


In this paper, we consider the problem of constructing optimal and near-optimal test se- 
quencing algorithms for multiple fault diagnosis. The computational complexity of solving the 
optimal multiple-fault isolation problem is super-exponential, that is, it is much more difficult 
than the single-fault isolation problem, which, by itself, is NP-hard x [7]. By employing concepts 
from information theory and AND/OR graph search, we present several test sequencing algo- 
rithms for the multiple fault isolation problem. These algorithms provide a trade-off between 
the degree of suboptimality and computational complexity. Furthermore, we present novel di- 
agnostic strategies that generate a diagnostic directed graph (digraph), instead of a diagnostic 
tree, for multiple fault diagnosis. Using this approach, the storage complexity of the overall di- 
agnostic strategy reduces substantially. The algorithms developed herein have been successfully 
applied to several real-world systems. Computational results indicate that the size of a multiple 

fault strategy is strictly related to the structure of the system. 

1 This means that the computational requirements of an optimal algorithm cannot be bounded by a polynomial 
function of the number of failure sources and/or the number of tests. 
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1 Introduction 


The complexity associated with the maintenance of large integrated systems, such as the space 
shuttle or a modern aircraft consisting of mechanical, electro- mechanical and hydraulic subsystems, 
presents formidable challenges to manufacturers and end users. This is due to the large number of 
failure sources and the need to quickly isolate and rectify such failures with minimal down time. In 
addition, for redundant (fault-tolerant) systems and for systems with little or no opportunity for 
repair or maintenance during their operation (e.g., Hubble telescope, space station), the assumption 
of at most a single failure in the system between consecutive maintenance actions is unrealistic. 
Thus, the efficient maintenance of complex systems requires advanced diagnostic algorithms for 
multiple fault isolation. 

A review of existing literature [13] showed that multiple-fault diagnosis using artificial intelli- 
gence techniques is too expensive and slow for large systems. Davis [2, 3] described a fault diagnosis 
system that reasons from the knowledge of structure and behavior. Failure candidate generation in 
this approach occurs in three basic steps: circuit simulation and discrepancy collection, potential 
candidate determination, and global consistency determination using constraint suspension tech- 
niques. The approach of Davis [2, 3] can be extended to diagnose multiple faults. However, this 
approach would require the application of constraint suspension to all possible combinations of 
components, and consequently, suffers from computational explosion. De Kleer and Williams [4] 
presented a model-based approach to fault diagnosis. By keeping track of multiple sets of con- 
sistent and inconsistent components, their algorithm generates minimal sets of faulty candidates, 
rather than generating all possible candidates. This approach requires the complete specification 
of system components, the state and observed variables associated with each component, and the 
functional relationships among the state variables. However, the precise information required by 
these models is typically not available for complex systems and is too costly to obtain. In addition, 
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because of extensive use of functional simulation, this approach is extremely slow, and, thus, is 
not appropriate for fault diagnosis in large scale systems with the complexities of many orders of 
magnitude more than the examples presented in [4]. Sheppard and Simpson [19] provided a formal 
analysis of the multiple failure problem in the context of information flow model. They discussed 
the computational complexity of several algorithms for diagnosing multiple failures, and developed 
algorithms to generate multiple fault diagnoses for a given ambiguity group. However, this method 
does not take into account the failure probabilities of components or test costs. 

In this paper, we present several multiple fault test sequencing algorithms. First, we extend 
the single-fault strategy of our previous work [7, 8, 9, 11] to diagnose multiple faults by succes- 
sive replacement of single fault candidates. Using this strategy, we seek to isolate the potential 
single-fault candidates, then double-fault candidates, and so on. Since a component may be re- 
paired/replaced before confirming that it is indeed faulty, the probability of false alarm error or 
RTOK (retest OK) is higher than that with multiple fault strategies that use all informative tests 
before repairing a component in the system. Then, we focus on developing a class of Sure strategies 
[14] for diagnosing multiple faults that employ all informative tests before diagnosis. The basic idea 
of these strategies is to find one or more definitely failed components, while Dot making an error 
when other co-existing faults are present. Using these algorithms, the storage and computational 
complexity of the multiple fault diagnostic strategy are reduced substantially. 

The paper is organized as follows. In section 2, we formulate the test sequencing problem. 
Because of extensive use of single fault test sequencing algorithms in solving the multiple fault 
diagnosis problem, we describe single fault test sequencing algorithms in section 3. In section 4, 
we present the problem of diagnosing multiple failures using a single fault diagnostic strategy. In 
section 5, we present an extended single fault strategy to diagnose multiple failures. Near-optimal 
multiple fault strategies are discussed in section 6. In section 7, we summarize the results and 
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discuss future research issues. Throughout, an example from [7] will be used to illustrate the 
concepts and the proposed diagnostic strategies. In addition, we apply our algorithms to several 
real-world examples. 

2 Problem Formulation 

The multiple fault test sequencing problem, in its simplest form, is defined by the five-tuple ( 
S, P,T,C,B ), where 

1. S = {$i, ..., s m } is a set of independent failure sources associated with the system; 

2. P = [p(si), ...,p(s m )] is the a priori probability vector associated with the set of failure sources 

S'; 

3. T = is a finite set of n available binary outcome tests, where each test tj checks 

a subset of S ; 

4. C = {cx,C 2 ,...,c n } is a set of test costs measured in terms of time, manpower requirements, 
or other economic factors, where Cj is the cost of applying test tj; 

5. B = [6jj] is a binary matrix of dimension m x n which represents the relationship between 
the set of failure sources S and the set of tests T, where = 1 if test tj monitors failure 
source s t -; otherwise, = 0. 

The problem is to design a testing strategy that unambiguously isolates the failure sources with 
minimum expected testing cost / = ]Cs 7 c 5 Y^t,ePTi P($i) c ji where PTj is the set of applied tests 
(performed tests) in the path leading to the isolation of the set of failure sources Si, and p(Si) 
is the probability of the set of failure sources Si (see Appendix A). The AND/OR sequential test 
strategy is represented in the form of a tree or a graph, where the OR nodes represent the suspect 
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sets of failure sources, AND nodes are tests applied at various OR nodes, and the leaves are the 
isolated failure sources. 

For notational convenience, we define failure signature FSi to denote a set associated with 
failure source s; that indicates all the tests that monitors failure source s,-, i.e., FS\ = {tj\bij = 
1 for 1 < j < n}. Furthermore, we assume that the failure signature of a multiple-failure is the 
union of failure signatures of individual failures. 


3 Single Fault Testing Strategies 


In a single fault strategy, it is assumed that the system is tested frequently enough that at most 
one component has failed. The single fault diagnosis problem, in its simplest iorm, is the five-tuple 
(S,P,T, C,D), where 


• S =S V {so}={so, si, ..., s m } is a set of failure sources, where so is a dummy failure source 
denoting fault-free condition and V denotes the union of two sets; 


• P ={po,pi, ..., p m ] is the conditional probability vector associated with the set of failure sources 
S based on a single fault assumption, where po is the probability of fault-free condition, sq. 
In Appendix A, we show that the conditional probability pi is related to the unconditional 
prior probabilities {p(-s t )} via: 


Po = 


Pi = 


1 + VT , --P-tyK 

F>k~ 1 I_p(s fc ) 






for i — 1, ..., m 


• T and C are as defined in Section 2; 


( 1 ) 


• D = [dij] is a binary test matrix of dimension (m + 1) x n, where doj = 0 for 1 < j < n, and 
dij = bij for 1 < i < m and I < j < n. 
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The algorithms for designing optimal single-fault diagnostic strategies are based on dynamic 
programming (DP) [1], and AND/OR graph search procedures. The DP technique is based on 
a bottom-up procedure, and has storage and computational requirements of 0(3”) for even the 
simplest test sequencing problem. The AND/OR graph search algorithms are top-down heuristic 
graph search procedures that employ a cost-to-go estimate to speed up the solution search process 
[7]. A novel feature of this approach is that the cost-to-go estimate (termed the Heuristic Evaluation 
Function (HEF)) is derived from Huffman coding and entropy. These information theoretic lower 
bounds ensure that an optimal solution is found using the AO*, HS, and CF search algorithms 
[9]. In addition, because of the top-down nature of the AND/OR graph search algorithms, several 
near-optimal search algorithms have been deuved: (1) AO* algorithm, (2) limited search AO*, 
and (3) Multi-step information heuristics. Furthermore, because of their top-down nature, these 
algorithms extend naturally to: (1) modular diagnosis, (2) precedence constraints, setup operations, 
and resources and (3) rectification. The algorithms have been implemented in a software package, 
termed TEAMS (Testability Engineering And Maintenance System[9]). For convenience, these 
algorithms are referred to as the TEAMS-S algorithms [11], 

Example l.a: In this example, we consider the same system as in [7]. In this system, there 
are five failure sources s$. The set of five tests, labeled t 5 , may be used to identify the 
unknown failure sources. The test matrix, along with the a priori probabilities of failure sources 
and test costs, is shown in Table 1. Based on the assumption of at most a single fault in the system, 
the set of failure aspects S = {s 0) Si, •••j-Ss}, with the concomitant conditional probability vector 
P =[0.700, 0.01, 0.020, 0.100, 0.050, 0.120]. An optimal single fault test strategy for this example 
is shown in Figure 1. For this test strategy, the average test cost is J = ^ZtjePTi Pt-c,=2.18, 
where PT, is the set of applied tests (performed tests) in the path leading to the isolation of failure 
source S{ € S. 
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TESTS 


FAILURE 

TEST 

COSTS 

c i 

FAULT 

SOURCES 

1 

1 

1 

1 

1 

PROBABILITIES 


h 

h 

h 

*4 

*5 

P( s i) 

S\ 

0 

1 

0 

0 

1 

0.014 

S2 

0 

0 

1 

1 

0 

0.027 

S 3 

1 

0 

0 

1 

1 

0.125 

H 

1 

1 

0 

0 

0 

0.068 

Sh 

1 

1 

1 

1 

0 

0.146 


Table 1: Test Matrix, a Priori Probabilities and Test Costs for Example l.a 



Figure 1: Single-fault Test Strategy for the System of Example l.a 


The single fault assumption may not be valid in situations where the opportunity for frequent 
maintenance does not exist. In such cases, the single fault strategies can give wrong diagnosis when 
multiple faults occur. For example, consider a system with S = T = TS\ = 

{ s r? s 3 } and TS 2 = {.S 2 , ^ 3 }, where test signature TSj is a set associated with test t 3 that indicates 
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all the failure sources detectable by test tj, i.e., TSj — {s,|6 t j = 1 for 1 < * < to}. Suppose that 
we perform both tests and that they both fail. Under the single-fault logic, we would conclude 
that S3 is faulty. However, if si and s% were both faulty, we would observe the same test results. 
Consequently, the single-fault strategy would make an incorrect diagnosis, when Sj and s 2 are both 
faulty. 

In the following, we define hidden and masking false failures, which are possible multiple fault 
candidates at each leaf node of the single fault diagnostic tree. The set of hidden failures H F{ for 
failure source Si is given by: 

EFi = {sj\j £ i and (FS { n PTi) U (FSj n PT t ) = ( FSi n PT t -)} 

In words, HFi consists of those failure sources whose failure signatures corresponding to the 
set of applied tests PTi in the path leading to the isolation of failure source s t - are masked by the 
failure of s t -, i.e., subset of the failure signature of Si restricted to PTi. The set of masking false 
failures MSi for failure source s t - consists of those sets whose failure signatures corresponding to 
PT{ add up to mask the failure of s;, i.e., 

MSi = {X\X C (S - Si), Uv Ske x(FS k n PTi ) = (FSi n PTi)} 

The multiple fault ambiguity group at a leaf node of the single-fault diagnostic strategy where 
failure source s,- is isolated consists of masking faults MSi and any combination of masking faults 
MSi and hidden faults HFi with s,-, i.e., MSi U (MSi X {• s «}) U ( 2 HFi x {s^}), where x denotes 
cross product function and 2 HFi is the power set of HFi. The problem of identifying the set of 
hidden failures is relatively easy to solve. In contrast, the problem of enumerating the masking false 
failures for each failure source S{ is computationally expensive. Typically, it requires 0(\PTi\2 m ) 
or 0( 2 m ) operations [10]. 



4 Multiple-Fault Isolation Using Single-Fault Strategy 


One often stated premise is that one can apply single fault strategy repeatedly, until all the 
faults are isolated. This strategy works well when there are no masking false failures at the leaf 
nodes of the single-fault diagnostic tree. However, if the set of masking false failures at the leaf 
nodes are not empty, the single fault strategy will give wrong diagnosis. In order to illustrate this 
case, let us assume that s 4 and S 3 , in Example l.a, are faulty. Based on single fault diagnostic 
tree, t 2 = / and t 4 = /; and we would assume that s 5 is faulty. After repairing/replacing 55 , we 
would perform more tests from the root OR node, t 2 = / and t 4 = /, i.e., the same test results 
as before. This is because {si,S 3 } G MS5 = {{sx,s 2 }, { 51 , 53 }, {s 2 ,s 4 }, {s 3 ,s 4 }, {s!,s 2 , 53 }, 
{s 2 ,S 3 , 5 4 }, { 5 i,s 2 ,s 4 }, {si,s 3 ,s 4 }, {si, 5 2 , 5 3 , 5 4 }}. In tills example, 18 failures out of 32(= 2 5 ) 
multiple failures can not be isolated by repeatedly using the single fault diagnostic tree. This is 
because \MSs\ + \AIS5 x {ss}| = 18. The occurrence of masking false failure sets is fairly common. 
In order to illustrate this, we generated 10 random systems with five components, five tests, P =[ 

0.5, 0.5, 0.5, 0.5, 0.5], and C = { 1 , 1 , 1 , 1 , 1 }. Only 2 systems did not have masking sets, and the 
average size of masking sets based on all systems was 6 . Therefore, on the average, 12 multiple 
failures out of 32 failures can not be isolated in these systems via repetitive application of single 
fault logic. 

In addition to this set of synthetic problems, we have considered several real-world systems. 
These include: 

1. Anticollision Light Control System of the Sea Hawk helicopter with 43 failure sources and 53 
tests, 

2. An amplifier-filter with 80 failure sources and 25 tests, 

3. 1553 Data Bus with 176 faults and 53 test points, 
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System 

m 

n 

% Leaves with Masking False Failures 

Anticollision system 

43 

53 

47.50% 

Amplifier Filter 

80 

25 

17.86% 

1553 Bus 

176 

53 

20.59% 

Goodrich (EDIF) 

898 

250 

0% 

Phase Decoder (EDIF) 

1644 

2147 

3.02% 


Table 2: Percentage of Leaf Nodes with Masking False Failures 

4. A circuit board model (courtesy of Goodrich Aerospace) generated from an ED IF (Electronic 
Design Interchange Format) 2 netlist containing 898 faults and 250 tests, 

5. Phase Decoder model (public domain test circuit for EDIF parsers) with 1644 faults and 2147 
tests. 

Table 2 shows the percentage of leaf nodes which contain at least one masking false failure. In 
conclusion, single fault diagnostic tree can be used to isolate multiple failures in systems with no 
masking sets. However, as the above results show, the masking sets in most systems are not empty. 
Consequently, practical multiple fault diagnosis algorithms are needed. 

5 Multiple Fault Diagnosis Using an Extended Single Fault Testing Strategy 

In this approach, we invoke a single fault strategy, and repair/replace the identified component 
at each leaf node, if any. Then, we check whether the repaired/replaced component at each leaf 
node is definitely faulty or not. If for any test tj that failed previously, the cardinality of TSj — G is 
one, i.e., TSj — G contains only one failure source, then the corresponding failure source is definitely 
faulty, where G is the union of test signatures of previously passed tests. If the repaired/replaced 
2 http://www.cs.man.ac.uk/cad/EDIFTechnical Centre. 
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component is definitely faulty, we apply additional tests, if necessary, to isolate the remaining faults. 
Additional tests can be applied from either the root OR node, or from the first failed test in the 
path leading to the identification of previous faults. This process ensures that we do not come back 
to the same leaf node twice. 

Alternatively, if the replaced module is not definitely faulty, there exist other sets of components 
which have the same failure signature as the failure signature of replaced module, i.e., masking false 
failures. In this case, if we start from the root OR node or the first failed test in the path, we may 
reach the same leaf node. In order to solve this problem, we remove the replaced modules from the 
ambiguity group at the current stage of diagnosis, and invoke the single fault strategy TEAMS- S 
to isolate the remaining suspected components. Then, we repair /replace the identified modules at 
each leaf node. If the repaired/replaced module at a leaf node of this tree is definitely faulty, we 
apply additional tests from the root OR node or from the first failed test after last repair. On the 
other hand, if the identified module at a leaf node is not definitely faulty, we update the ambiguity 
group and invoke single fault strategy as before. This procedure is continued until no test gives 
further information or the system is fault-free. The extended single fault algorithm is formalized 
in the next subsection. 

Example l.b: In this example, we consider the same system as in Example l.a. The extended 
single fault diagnostic strategy for this example is shown in Figure 2, where the ACTION nodes 
represent the actions to be performed at the corresponding OR node. Note that the shaded parts 
of the tree are the same as those in a single fault diagnostic tree of Figure 1. The average testing 
cost for this case is J =2.780. The joint probability that ss is good, and is repaired/replaced is 
0.0103. 

We applied the extended single fault strategy to several real-world systems. Table 3 shows the 
times taken to construct an extended single fault diagnostic strategy for several real-world systems. 
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Figure 2: Extended Single-fault Strategy to Diagnose Multi-faults in Example l.a 
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System 

(m, n) 

Time (sec) 

# Repair Limits 

n 

2 

3 

All 

Anticollision system 

(43, 53) 

0.27 

2.41 

2.93 

6.93 

Amplifier-filter 

(80, 25) 

0.18 

0.63 

0.88 

0.90 

1553 Bus 

(176,53) 

0.27 

1.81 

1.81 

1.81 

Goodrich (EDIF) 

(898, 250) 

0.88 

0.88 

0.88 

0.88 

Phase Decoder (EDIF) 

(1644, 2147) 

41.59 

370.00 

691.94 

1132.55 


Table 3: Solution Times in Seconds Based on Extended Single Fault Strategy for Various Real-world 
Systems on a SPARC- 10 

Table 4 shows the number of nodes in the extended single fault diagnostic strategies for these 
real-world systems. 3 

One drawback of the extended single-fault strategy is that the probability of repairing/replacing 

a good component, i.e., false alarm error or RTOK (retest OK), is higher than that with multiple 

fault strategies that employ all informative tests before repairing a component in the system (see 

section 6.2). Furthermore, in the case of very large systems, it is practical to solve multiple fault 

isolation problems up to a certain cardinality L > 1, e.g., single or double failures. This is based on 

the premise that multiple faults of large cardinality are much less likely to occur. However, in an 

extended single fault strategy, if we stop expanding the diagnostic tree after limited repair actions, 

say L, it does not mean that we can diagnose multiple faults up to size L using the same tree. This 

is because a component may be repaired/replaced before confirming that it is indeed faulty. 

3 In order to reduce the search space, the TEAMS-S algorithms preprocess the binary D-matrix as follows: (1) 
they collapse all the failure sources with the same failure signature to create a new representative failure source, and 
(2) they eliminate the redundant tests [10] (see section 6.3). 
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System 

(m, n) 

# Nodes 

# Repair Limits 

n 

2 

3 

All 

Anticollision system 

(43, 53) 

79 

897 

1065 

2371 

Amplifier-filter 

(80, 25) 

55 

231 

347 

347 

1553 Bus 

(176,53) 

67 

417 

417 

417 

Goodrich (EDIF) 

(898, 250) 

37 

37 

37 

37 

Phase Decoder (EDIF) 

(1644, 2147) 

993 

8919 

16389 

25663 


Table 4: Number of Nodes in Extended Single Fault Strategy for Various Real-world Systems 
5.1 Extended Single-fault Algorithm 

Extended single fault algorithm is a recursive function, and must be invoked as Extended- 
Single-Fault( OR node , SS), where 

• OR node is the current OR node, 

• SS denotes the suspected faults at the current OR node. 

Global variables: 

• the root OR node of the diagnostic tree, 

• the set of failure sources S = {si, ..., s m }, 

• the a prior probability vector P = [p(sx), ...,p(s m )], 

• the set of available tests T = {fi,t 2 , 

• the set of test costs C = {ci,C 2 , ...,c n }, 

• the binary test matrix B — [6,y]. 

Initialization: 

• OR node= root OR node, 

• SS = S = 5 U {so}- 

Algorithm: Extended-Single- Fault( OR node , SS) 
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step 1: Evaluate the conditional probability of the faults in SS using P, P s . 
step 2: Expand the diagnostic tree from the OR node by invoking 
TEAMS-S ( SS,P s ,T,C,D s ), where D s contains the failure 
signatures of the failures in SS. 
step 3: DO for each UNSOLVED leaf node, 

step 3.1: Action: repair /replace the identified component, if any. 
step 3.2: G «- {repaired/replaced failure sources} U ti - p TSi 
for ti in the path from OR node to the leaf node, 
step 3.3: SS <— SS — G . 

step 3.4: IF for any failed test tj in the path from OR node to 
the leaf node, | TSj - G\ = 1 THEN 
-IF SS = {s 0 }, 

Action: stop. 

ELSE 

Action: Apply additional tests from the 

root OR node or the first failed test 
after the OR node. 

END 

ELSE 

- SS.±r SSu{s 0 }. 

- Extended- Single-Fault( leaf node, SS). 

END 

END 


6 Multiple Fault Testing Strategies 

One approach that employs all informative tests before repairing/replacing a component is to 
consider all possible combinations of failure sources, i.e., 2 s , and generate an optimal multiple 
fault diagnostic strategy using the single-fault test sequencing algorithm TEAMS-S. However, 
the storage and computational complexity of optimal multiple-fault isolation problem is super- 
exponential in m. In order to reduce storage complexity, we use a compact set notation [6], and 
in order to reduce the computational complexity, we present a class of Sure diagnosis strategies for 
multiple fault isolation. 

6.1 Compact Set Notation 

Following Grunberg et al. [6], we use the compact notation A= 0(L; Pi, ..., Fj_,\ G ) to denote 
the multiple fault ambiguity group at each OR node. The F{ for i = !,...,£ and G are subsets 
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of S = {so,.si,...,s m }; G is the set of known good failure sources (failure free sources), and F, for 
i = 1 , L are sets that are known to contain at least one definitely failed failure source each, i.e., 

0(X;Fi,F 2 ,...,Fl;G) = {X CS| 

X A JFi 7^ 0 for i = 1, L, and X A G = 0} (2) 

where A denotes the intersection of two sets. In the following, we summarize some of the properties 
of compact set notation [14, 15, 16]: 

1. Multiple fault logic using the compact set notation is as follows: the initial hypothesis set is 
the set of all subsets of S, i.e., A= 0(1; Fi = S ; G = 0). After performing a test, say tj , the 
hypothesis set A = 0(Z; Fi,..., Fl', G) is decomposed as follows: 

@(L; (Fi A TSj ), ..., (F L A TSj)- ( G V TSj )) if tj passes 

A <— 

Q(L + 1; f\, ..., F l , TSj A G C ;G) if tj fails 

where superscript c denotes the set complement, i.e., G c = S — G. 

2. If ED Fj for some i (that is, E is a superset of Fi), then Q(L + 1; F\, ..., Fl, E ; G)=Q( L\ 
Ft, ... ,F L ; G) [ 6 ]. 

3. A = Q(L; Fi, ..., F L ; G) = 0(£; Fx A G c , F L A G c ; G) [6]. 

4. Given a set of previously applied passed tests T p C T and failed tests Tj C T, the multiple 
fault ambiguity group at the current stage of diagnosis can be generated directly as follows: 
0(F; Fi, ..., Fl; G), where G = V tie r p T5,-, £=|T/|+1, F a =S (see the first property), and 
Fi + 1 =TSj A G c for i = l, ..., |T/| and tj € T/; ajid then, employ property 2 to remove super 
sets from the set F = {Fi, ..., F l}. 

5. If |T/| = 0, then L = 1 and s 0 € Fj. If \Tj\ > 0, none of the Ft’s contains so (see the first 
property). 
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6. The worst case storage complexity of compact set notation for an OR node is O(mra). This 
is because the ambiguity group 0(L] Fi, F 2 , Fl; G) contains all solutions of the following 
constraint equations: 

Wy > e 

yi = 0 if g{ — 1, for i = 0, m 

where y = [2/0, 2 / 1 , Vm]' is a binary vector; e is the L-dimensional vector of l’s; W = [w{j] 
is a binary matrix of dimension L x (m -f 1), and W{j = 1 if Sj 6 F t , otherwise Wij = 0; and 
g = -•-5 5m] / is a binary vector such that gj = 1 if Sj 6 £7, otherwise gj = 0. Using 

this notation, we need to store the binary matrix W and binary vector g at each stage of 
diagnosis. Therefore, the storage complexity of this approach is O(mn) at each OR node, 
since L < n and each test is applied at most once in each path of the diagnostic tree. 

7. The failure sources belonging to F{ with cardinality jTi| = 1 are definitely faulty ( one-for-sure 
condition). This can easily be shown using equation (2). 

6.2 Sure Strategies for Multiple Fault Diagnosis 

In this section, we present three diagnostic strategies, Sure 1-3, that seek to find definitely 
failed components, even though there may be others still undiagnosed. Thus, these strategies 
isolate failures one (or more) at a time, while not making an error when multiple faults are present. 
The framework for Sure strategies is sketched in Figure 3. 

The three basic ingredients of Sure 1-3 are: (i) minimal candidate generation , (ii) minimal 
candidate isolation , and(iii) multiple fault propagation. The minimality property implies that a 
particular candidate includes the minim um number of failure sources that explains all test results 
observed so far (if any). Consequently, the inherent combinatorial explosion that occurs in gener- 
ating an optimal multiple fault strategy is reduced substantially. Before describing the algorithms, 
we define minimal (irreducible) set and hitting set of a set of subsets: 
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Figure 3: Framework of Sure Strategies in a Test-and-repair Cycle 

Definition 1 : A minimal or irreducible set for a collection of subsets Q = {Q x Qk} is a set 
I(Q ) C Q such that I(Q ) = Q — {Qi\3Qj € Q and Qj C Qi}, i.e., I(Q) is equal to set Q without 
any super set. 

Definition 2 : A hitting set for a collection of sets Q = {<2i, ..., Qk} is a set H(Q) ={H \, ..., H q } 
such that Hj C Vi<,-<fc<3i for j = 1 , q , and Hj A Qi ^ 0 for i = 1, 

Based on these definitions, it can be shown that [12]: 

Lemma 1 : The minimal set of a multiple fault ambiguity group A = ©(/-; R , FL; (?) is the 
minim al hitting set for the collection of sets F = {F\, Fl}, i.e., F(A) = I(H(F)). 

Surel-Sure3 algorithms are recursive procedures. At each iteration, we consider the minimal 
candidate set of the multiple fault suspect set corresponding to the OR node at that stage. Re- 
iter [12] has derived an algorithm to determine the minimal hitting set of a collection of sets, and 
Greiner et al. [5] have presented a correction to the Reiter’s algorithm. We use this technique 
to determine the minimal hitting set of F = {Fi,...,Fi,} at an OR node. After determining the 
minimal candidates of a multiple fault suspect set at the current stage, we evaluate the conditional 
probabilities of minimal candidates using Bayes’ rule. Then, we invoke the single fault strategy 
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TEAMS-S to isolate these candidates, and propagate multiple fault suspect set through the re- 
sulting diagnostic tree. Note that using the fourth property of compact set notation, it is sufficient 
to generate and store multiple fault ambiguity group at the leaf nodes of this tree only. We repeat 
these procedures for each leaf node of the tree until: (1) the intersection of minimal candidates 
is not empty, i.e., the corresponding failure sources are definitely faulty, or (2) no test provides 
further information. The former corresponds to the case when the cardinality of one or more Fi in 
the ambiguity group is one. 

After repairing/replacing the components isolated by Sure strategies, we apply additional tests, 
if necessary, to isolate the remaining failure sources. We explore three different approaches for the 
application of additional tests: (1) start from the root OR node of the diagnostic tree; (2) start from 
the first failed test in the path leading to the isolation of previous faults; (3) update the multiple 
fault suspect set at the leaf node by integrating previous test results using the fourth property of 
the compact set notation, removing repaired/replaced failure sources from the ambiguity group at 
the leaf node, and invoking Sure strategies for the updated ambiguity group. Sure 1-3 algorithms 
correspond to the first, second and third approaches for applying additional tests, respectively. 
These are presented in detail in the next subsection. 

The Surel diagnostic strategy is simple and the resulting diagnostic tree is very similar to the 
single fault diagnostic tree. However, the expected testing cost using this strategy is usually high. 
The expected testing cost using Sure2 diagnostic strategy is less than the first one, but the next 
test to be performed after repairing/replacing each failure source will be different. Furthermore, 
the diagnostic tree will change to a digraph (directed graph). The expected testing cost for the 
third approach is the smallest, but the size of the diagnostic tree will be considerably larger than 
the others. This is because the number of leaves of the diagnostic tree is the same as the number of 
distinguishable multiple-fault failure signatures. For example, in the worst case, i.e., when the test 
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matrix B is diagonal, the number of leaves is 2 m . This is because there are 2 m possible multiple- 
fault failure signatures. But, the number of leaf nodes in Surel and Sure 2 diagnostic strategies in 
this case are the same as in a single-fault strategy, i.e., m + 1. 

One of the interesting features of Sure strategies is that the starting point for all three algorithms 
is the same tree as in a single fault strategy for the system under consideration. This is because the 
minimal candidate set for 2 s is {s 0 , 54, ...,s TO }. Therefore, these strategies isolate a single fault with 
the smallest average cost, while not making an error when multiple faults are present. Furthermore, 
in the case of very large systems, instead of generating all minimal candidates, we can generate 
minimal candidates of size less than a certain threshold, L , and diagnose multiple faults up to that 
size. 

Example l.c : Figure 4 , without (with) the dashed lines, shows the multiple fault strat- 
egy for the system in Example l.a, based on Surel(Sure 2 ) algorithm, where denotes the 
ambiguity group corresponding to the OR node i, and A\= 0 ( 1 ; {so,S4,S2, -53,54, S5} ; 0 ); A2— 
0(1) {So, S2, ’S3} , {si , 54, 55} ), A3 — 0(1, {si, S4, S5} ,0), A4 — 0(1, {so} ,{si, $2 , S3, S4, 55}), A§ — 0(1, 
{S2)S3} , {54,54,55}), Aq — 0 ( 1 , {54,64}, {52, 53, 55}), Ay — 0 ( 2 , {54,54,55}, {^2, 53, 55}, 0 ), Ag 
0 ( 1 ) {S2}) {S4 5 S3, 54, 55}), A9 — 0 ( 1 , {53}, {54,54,55}), A40 = 0 ( 1 , {S4}, {52,53, S45S5}), A44 
— 0 ( 1 , {54}, {52,53,55}), A42 — 0 ( 2 , {54, S5}, {52, 55}, {54, 53}), A43 — 0 ( 3 , {S4,54,Ss}, {52,53, S5}, 
{si.,53}, 0 ), A44 — 0 ( 2 , {53}, {54, 54}, {52,55}), A45 — 0 ( 3 , {54,54,55}, {54 , 53}, {52, 55}, 0 ), A46 
= 0 ( 2 ; {54}, {52}; {53,54,55}); A17 = 0 ( 4 ; {54,53}, { 5 2 , 5 5 }, {53,54,55}, { 54 , 5 4 , 5 5 }; 0 ) 

Note that the shaded parts of the tree are the same as those in the single fault diagnostic tree 
of Figure 1. The average testing cost for the optimal multiple fault strategy is J — 2.411, and 
the average testing cost for the first (Surel) and second (Sure2) approaches using the diagnostic 
strategy of Figure 4 are J = 2.715 and J = 2.616, respectively. 

Example l.d: The Sure3 strategy for Example l.a is shown in Figure 5, where A\% = A 20 = 
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Figure 4: Surel and Sure2 Test Strategies for Example l.a 
A 24 = 0(1; {so}; {si, s 2 , s 3 , s 4 , 5 5 }); A19 = 0(l;{s 2 };{si, 5 3 , 54, 5 5 »; A 23 = 0(l;{s 4 }; {s 2 ,s3,s 5 }); 

A21 = A 22 = A25 = 0(1;{5i};{5 2 ,53, 5 4 ,5 5 }); 

Note that the shaded and dashed parts of the tree in Figure 5 are the same as those in Figure 
4. For this test strategy, the average test cost J = 2.535. In this example, we considered a block 
replacement strategy when no test gives further information, for example, see ambiguity groups 
Ai 2 and A17. 

We applied Sure algorithms to several real-world systems. Table 5 shows the times taken to 
construct diagnostic strategies based on Surel and Sure2 diagnostic strategies for several real-world 
systems. Table 6 shows the number of nodes in the Surel and Sure2 diagnostic strategies for these 
real-world systems. 

We applied Sure3 diagnostic algorithm to several real-world systems. Table 7 shows the times 
taken to construct a diagnostic strategy based on Sure3 strategy for several real-world systems. 
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System 

(m, n) , 

Time (sec) 



# Fault Limits 



D 

2 

3 

All 

Anticollision system 

(43, 53) 

0.27 

1.71 

5.98 

26.28 

Amplifier-filter 

(80, 25) 

0.18 

0.23 

0.26 

0.27 

1553 Bus 

(176,53) 

0.27 

0.50 

0.82 

1.05 

Goodrich (EDIF) 

(898, 250) 

0.88 

0.88 

0.88 

0.88 

Phase Decoder (EDIF) 

(1644, 2147) 

41.59 

461.24 

1194.16 

(>2400) 


Table 5: Solution Times in Seconds Based on Surel and Sure2 Strategies for Various Real-world 
Systems on a SPARC- 10 


System 

(m, n) 

# Nodes 



# Fault Limits 


1 

2 

3 

All 

Anticollision system 

(43, 53) 

79 

521 

1889 

7257 

Amplifier-filter 

(80, 25) 

55 

75 

83 

89 

1553 Bus 

(176,53) 

67 

123 

225 

289 

Goodrich (EDIF) 

(898, 250) 

37 

37 

37 

37 

Phase Decoder (EDIF) 

(1644, 2147) 

993 

8843 

21347 

out of memory (> 79000) 


Table 6: Number of Nodes in Surel and Sure2 Strategies for Various Real-world Systems 
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Figure 5: Sure3 Test Strategy for Example l.a 

Table 8 shows the number of nodes in the Sure3 diagnostic strategy of these real-world systems. 

The computational results indicate that the size of the diagnostic strategy based on Sure3 is 
considerably larger than the others, and consequently, Sure3 diagnostic strategy cannot be applied 
to large-scale systems. 


6.2.1 Sure Algorithms 

Sure algorithms are recursive functions, and must be invoked as Sure( OR node , A m , surei ), where 

• OR node is the current OR node, 

• A m — 0(Z; Fi, 1*2, ..., Fi; G) is the multiple fault ambiguity group at the OR node , 

• surei denotes the Surel-Sure3 diagnostic strategies. 

Global variables: 
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System 

(m, n) 


Time 

(sec) 





# Fault Limits 




B 

2 

3 

All 

Anticollision system 

(43, 53) 

0.27 

6.09 

89.74 

>7200 

Amplifier-filter 

(80, 25) 

0.18 

2.16 

6.05 

803.27 

1553 Bus 

(176,53) 

0.27 

3.76 

16.72 

>5000 

Goodrich (EDIF) 

(898, 250) 

0.88 

4.92 

7.72 

19.69 

Phase Decoder (EDIF) 

(1644, 2147) 

41.59 

> 3600 

— 

— 


Table 7: Solution Times in Seconds Based on Sure3 Strategy for Various Real-world Systems 


System 

(m, n) 

# Nodes 




# Fault Limits 

' 



D 

2 

3 

All 

Anticollision system 

(43, 53) 

79 

1195 

13275 

>100000 

Amplifier-filter 

(80, 25) 

55 

601 

1619 

24204 

1553 Bus 

(176,53) 

67 

773 

3433 

>100000 

Goodrich (EDIF) 

(898, 250) 

37 

343 

553 

1463 

Phase Decoder (EDIF) 

(1644, 2147) 

993 

>100000 


— 


Table 8: Number of Nodes in Sure3 Strategy for Various Real-world Systems 
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• the set of failure sources S = {sj,...,s TO }, 

• the a prior probability vector P = [p(si), ..., p(s m )], 

• the set of available tests T = 

• the set of test costs C = {ci, c 2 , Cn}, 

• the binary test matrix B = [6,y]. 

Initialization: 

• OR node = root OR node, 

• A m = 0(1; Fi = S;<j = 0). 

Algorithm: Sure( OR node , A m , surei) 

step 1: Generate the minimal (or irreducible) set of the multiple fault 
ambiguity group A m , A s = I(A m ). 

step 2: Evaluate the conditional probability of faults in A s using P, P s . 

step 3: Generate the binary test matrix D s using B for the faults in A s 
failure signature of each fault in A s is the union of failure 
signatures of individual failures. 

step 4: IF no test gives any information, THEN 

step 4.1: Action: repair/replace all faults in U i<t<z,Ft — {•So}- 
step 4.2: G <— G U {repaired/replaced failure sources}, 
step 4.3: SS <— S — G. 
step 4.4: IF SS = {s 0 }, THEN 

- Action: stop. 

- label the OR node SOLVED, and RETURN. 
END 

step 4.5: IF surei is Surei, THEN 

- Action: apply more tests from root OR node. 

- label the OR node SOLVED, and RETURN. 
ELSE IF surei is Sure2, THEN 

- Action: apply more tests from the first failed 
test on the path from root OR node to the 
OR node. 

- label the OR node SOLVED, and RETURN. 
ELSE IF surei is Sure3, THEN 

- A m = 0(1; Pi = 55; G). 

- Invoke Sure( OR node, A m , surei). 

END 

END 

step 5: Expand the diagnostic tree from the OR node by invoking 
TEAMS-S ( A S ,P S ,T,C,D S ). 
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step 6: Propagate the multiple fault ambiguity group A m of 
the OR node along the tree. 

step 7: DO for each UNSOLVED leaf node, 

step 7.1: IF the multiple fault ambiguity group of the leaf 
node has guaranteed failures identified (i.e., F{ (s) 
with one member), 

THEN 

step 7.1.1: IF any F t - = {s 0 }, THEN 

- Action: stop. 

- label the OR node SOLVED, 
and CONTINUE. 

END 

step 7.1.2: Action: repair/replace the faults 
in Fi( s) with one member, 
step 7.1.3: G <— G'U {repaired failure sources}, 
step 7.1.4: SS <— S — G. 
step 7.1.5: IF SS = {so}, THEN 

- Action: stop. 

- label the OR node SOLVED, 
and CONTINUE. 

- END 

step 7.1.6: IF surei is Surel, THEN 

- Action: apply more tests from 
root OR node. 

- label the OR node SOLVED, 
and CONTINUE. 

ELSE IF surei is Sure2, THEN 

- Action: apply more tests from the 
first failed test on the path from the 
root OR node to the OR node. 

- label the OR node SOLVED, 
and CONTINUE. 

ELSE IF surei is Sure3, THEN 

- A m — 0(1; Fi = SS;G). 

- Invoke Sure( leaf OR node, A m , surei). 
END 

ELSE 

step 7.1.7: A m <— Multiple fault ambiguity groups 
of the leaf node. 

step 7.1.8: Invoke Surei( leaf OR node, A m , surei). 
END 

END 
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6.3 Computational Issues 


In order to make the algorithm efficient, we find all the failure sources with the same failure 
signature in the test matrix. That is, we generate the set N={iV 1 , N 2 ,—,Np} such that Nt C S 
for l — 1, ...,/? and Vs t - € Ni have the same failure signatures in the binary test matrix. Thus, 
instead of invoking Sure strategies for the set S, we can invoke them for the set N. In this case, the 
probability that none of Si € Ni is faulty, i.e., p(Ni), and only one of s,- 6 Ni is faulty, i.e., p{Ni), 
can be evaluated as follows: 


p( N d = n - p^)) ( 3 ) 

si£N t 

K^)= E K«.-) II (! - pOj)) 

SiSNi 

Thus, using p(Ni) and p(Ni ), the conditional probabilities of minimal candidates can be eval- 
uated. For example, the conditional probabilities associated with the set N at the starting point, 
i.e., based on a single fault assumption, can be evaluated as follows: 


Po 


PNi 


i+cL,® 

p(Jfi) 

mo 


for /=!,...,/? 


i + EL,® 

In the case, when \Ni\ = 1 for / = 1, and f3 = m, (4) reduces to (1). 


(4) 


7 Summary 


The computational and storage complexity of an optimal multiple fault strategy are super- 
exponential in the number of failure sources, m. We presented several near-optimal algorithms 
that provide a trade-off between optimality and computational complexity. Firstly, we extended 
the single-fault strategy of our previous work [7, 8, 11] to diagnose multiple faults by successively 
isolating the potential single-fault candidates, then double-fault candidates, and so on. This is one 
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of the simplest multiple fault strategies that one can use. In this approach, the storage complexity 
at each OR node of the AND/OR graph is the same as that in a single fault strategy. However, 
using this approach, the probability of false alarm error or RTOK is very high. 

We then extended the single fault sequential testing strategies to a class of Sure strategies. The 
basic idea of these strategies is to find one or more definitely failed components, while not making 
an error when other co-existing faults are present. We explored three different approaches for the 
application of additional tests, resulting in Surel-3 strategies. 

Some of the advantages of using Sure strategies are: (1) the inherent combinatorial explosion 
that occurs in generating an optimal multiple fault strategy is reduced substantially, (2) the first 
iteration of the Sure strategies results in the same tree as in the single fault strategy, and therefore, 
these strategies isolate a single fault with the smallest average cost, while not making an error when 
multiple faults are present. Computational complexity of this approach is strictly related to the 
structure of the system, i.e., the structure of test matrix B. 

In order to overcome the problems associated with the size of the complete diagnostic strategy, 
the test strategy should be generated ’’on-line”. That is, instead of generating the entire diagnostic 
tree, the interactive strategy only suggests the next test to be applied given the outcomes of 
previously applied tests. In addition, we assumed that the failure signature of each multiple failure 
is the union of the failure signatures of individual failures. However, this assumption does not hold 
for fault-tolerant (redundant) systems. In order to solve this problem, a binary test matrix based 
on minimal candidates, i.e., minimum number of failures with a failure signature different from the 
union of failure signatures of individual failures, should be generated. We expect to investigate 
these challenging issues in our future efforts [17, 18]. 
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Appendix A: Conditional Probabilities of Failure Sources 

Let us assume that hypothesis Si C S = {.si, s m } denotes a set of failure sources such that 
{s t - G Si} axe faulty and {sj G 5/} are fault-free. Thus, Si can be represented as an m-dimensional 
hypothesis vector x = [«i, ..., where X{ = 1 if s,- G Si ; otherwise, x t - = 0. From the failure 
independence assumption, the probability of hypothesis vector x is; 




i=l 




»=i 


P( s i) J 


( 5 ) 




where 0 is a zero vector of dimension m, and p(x = 0) is the probability of fault-free state of the 
system. Using Bayes’ rule, the conditional probability of failure hypothesis x based on a single-fault 
assumption, i.e., P_ = {po,Pi, is as follows: 


p(‘k) 


i +■ — a 

1+ 2-,k= 1 (1-; 


if Xi = 1 and xj = 0 V j ^ i 


p(x\SF) = { 


(i-p («*)) 


! . K»k) 


if x=0 


(i -K s fc)) . 


( 6 ) 


0 otherwise 

Thus, the conditional probability of failure source s,- given the single fault assumption, p,-, is 
the conditional probability of hypothesis vector x = e,-, where e t - is the i-th unit vector, i.e., £,• = 1 
and x j = 0 V j ^ i, and po = p(x = 0|ST). 

Note that a priori probability of failure source s,-, i.e., p(s t ) , can be derived from the distribution 
function F,-(f) as p(s,) = Fi(to), where Fi(t) is the probability that failure source s t - has failed at 
or before time t, and to is the UPTIME. In the following, we consider two special cases: (1) 
Exponential distribution, and (2) Weibull distribution. 
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A.l: Exponential Distribution 

In this case, F{(t) = (1 — e~ Xit ) for i = 1 where A,- = 1/MTTF; is the failure rate and 

MTTF t - is the mean time between failures. Thus, 

p(si ) = (1 - e~ Xit °) for i = (7) 

m 

p(s o) = II( 1 _ p( s *)) = e ~ Xrt ° 

where Ay = E- The conditional probability of each failure source s,-, using equation (6), is as 
follows: 


Pi 


e Xit ° — 1 


o^itQ 


Po = 


1 + E,”Li(e A ^° - 1) 1 - TO + EjLi e Ajt0 

1 


for i = 1, ..., m 


( 8 ) 


1 - m + EyLi e A ^° 

if A,t 0 << 1 for i = 1, ..., m, then e A<t ° « 1 + A ,-t 0 . Thus, equation (8) reduces to: 


Pi 


Po 


-Mo 


1 + Ayto 
1 

1 + Ayto 




for 


1, 


to 


1_ 

to 


i + X T 


m 


A. 2: Weibull Distribution 

In this case, Fi(t) — (1 — e~( Ait ) a ) where y- is the characteristic life and a is a shape parameter 
that changes the shape of the distribution compared with the exponential. Thus, 


p(si ) = 1 — e for i = 1, ...,m (9) 

m 

Po = IIC 1 -K 5 *)) = e“ (A “*° )a 

*'=i 
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where \ u , = (]Ci!Li Af )« . Therefore, the conditional probability of each failure source ,s t - is as 
follows: 


Pi 

Po 


e (-Mo) a _ i 


o^ito)° 


1 + EyLi (e (Aito) “ - 1) 1 - m + J2f =1 e(^)° 

1 


for i— 1, ..., m 


( 10 ) 


1 - m + EyLi e( Att °) Q 

If A,to << 1 for i = 1 , m, then e( A,t °) a w 1 + (A,to) a - Thus, equation (10) reduces to: 


Pi 


Po 


(A,-t 0 ) Q = Xj a 

l + E”Li(A^o)“ ^ + A W “ 

1 q? 

l + (A^o)“ = £ + A w a 

l 0 


7^(1 -Po) for i- 1, 


m 


(ii) 
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Abstract - In this paper, we consider the problem of 
constructing optimal and near-optimal multiple fault 
diagnosis (MFD) in bipartite systems with unreliable 
(imperfect) tests. It is known that exact computation 
of conditional probabilities for multiple fault diagnosis 
is NP-hard. The novel feature of our diagnostic al- 
gorithms is the use of Lagrangian relaxation and sub- 
gradient optimization methods to provide: (1) near 
optimal solutions for the MFD problem, and (2) upper 
bounds for an optimal branch-and-Bouhd algorithm. 
The proposed method is illustrated using several ex- 
amples. Computational results indicate that: (1 ) our 
algorithm has superior computational performance to 
the existing algorithms (approximately three orders of 
magnitude improvement over the algorithm in [3]), (2) 
the near optimal algorithm generates the most likely 
candidates with a very high accuracy, and (3) our al- 
gorithm can find the most likely candidates in systems 
with as many as 1000 faults. 

I. Introduction 

With the increased recognition of importance of de- 
sign for testability, there is an increasing trend towards 
the use of smart sensors for on-board system health man- 
agement. The results of on-board tests are available to 
the ground test systems and operators as a block of symp- 
toms. Due to improper set up, operator error, electro- 
magnetic interference, environmental conditions, or alias- 

* Research supported by NASA-Ames Research Center and the 
Department of Economic Development of the State of Connecticut. 


ing inherent in the signature analysis of on-board tests, 
the nature of tests may be unreliable (imperfect). Imper- 
fect tests introduce" ail additional element of uncertainty 
into the diagnostic process: the pass outcome of a test 
does not guarantee the integrity of components under 
test (because the test may have missed a fault), or a 
failed test outcome does not mean that one or more of 
the implicated components are faulty (because the test 
outcome may have been a false alarm). Consequently, 
the diagnostic procedures must hedge against this uncer- 
tainty in test outcomes. 

In this paper, we consider the problem of construct- 
ing optimal and near-optimal multiple fault diagnosis in 
bipartite digraphs with unreliable tests. This problem 
is a central and long-standing concern in system fault 
diagnosis, and medical decision making [10]. When the 
false alarm probabilities of all tests are zero, the problem 
simplifies to the parsimonious covering theory (or proba- 
bilistic causal model) discussed in [16]. Peng and Reggia 
[15] proposed a competition based connectionist method 
to subdue the problem of combinatorial explosion in com- 
puting the posterior probabilities of all possible combi- 
nations of failure sources in probabilistic causal models. 
However, this method does not guarantee a global opti- 
mum and suffers from large computation times even for 
problems with small numbers of failure sources, m=26. 

Genetic algorithms are offered as an alternative to the 
connectionist methods [3, 9]. Genetic algorithms are 
based on an analogy with Darwin’s biological evolution- 
ary theory in which a group of solutions evolves via natu- 
ral selection. It emulates the rules of biological evolution- 
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ary process, such as reproduction, crossover, mutation, 
and natural selection, etc. At each iteration, a popula- 
tion of individuals is established, where each individual 
corresponds to a point in the search space. The objective 
function is evaluated for each individual to rate its fit- 
ness. Then, a next generation is formed based on the sur- 
vival of the fittest. Therefore, the evolution of individuals, 
from generation to generation tends to result in fitter in- 
dividuals (i.e., solutions) in the search space. These algo- 
rithms converge extremely slowly, and have been applied 
to small problems with m=20 failure sources (causes, dis- 
orders) and n=20 tests (manifestations, symptoms). 

Wu [20] proposed a decomposition method based on 
common and disjoint causal (failure source) relationships 
among the given symptoms (tests). This method decom- 
poses the original problem into smaller and independent 
subproblems, and therefore, increases the performance 
and efficiency of multiple fault diagnosis. However, this 
approach is not applicable for systems with large num- 
bers of nondecomposable causes and symptoms. 

In this paper, we present a novel approach, using 
Lagrangian relaxation, to solve multiple fault diagnosis 
problem. By defining new variables and constraints, the 
multiple fault diagnosis (MFD) problem reduces to a 
combinatorial optimization problem with a set of equal- 
ity constraints. The constraints are relaxed via Lagrange 
multipliers. The relaxation procedure generates an upper 
bound for the objective function. The procedure of min- 
imizing the upper bound via a subgradient optimization 
produces a sequence of solutions that are modified, in a 
computationally effective way, to produce a sequence of 
feasible solutions to the MFD problem. If the objective 
function value for the best feasible solution and the upper 
bound are the same, the feasible solution is the optimal 
solution. Otherwise, the difference between the upper 
bound and the feasible solution, termed the approximate 
duality gap, provides a measure of suboptimality of the 
MFD solution. Alternatively, the optimal solution can 
be found via a tree search (or branch-and-bound) proce- 
dure. The computational complexity of the near-optimal 
algorithm is a linear function of the number of failure 
sources, m and the number of failed tests, |T/|. 

Next, we present an approach to determine a ranked 
set of multiple fault diagnosis solutions (i.e., the best, 
second best, ..., L-th best diagnosis). In this approach, 
following Murty [12] and Cox et. al. [5], we: (1) parti- 
tion the MFD problem, based on its best solution, into 
disjoint subproblems; (2) solve the subproblems and sort 
them by the values of their solutions, and (3) select the 
subsequent best solutions. One of the advantages of this 
approach, compared to the one in [14], is that since the 
subproblems are disjoint, the optimal solution of each 
subproblem is different from the others. Finally, we show 


that the M FD algorithm can be extended to solve mul- 
tiple fault diagnosis problems with repetitive application 
of tests. 

The paper is organized as follows. In Section II, we 
formulate the multiple fault diagnosis problem in a bi- 
partite system. In Section III, we present a near-optimal 
algorithm based on Lagrangian relaxation and subgra- 
dient optimization method to diagnose multiple faults, 
and generate an upper bound for the likelihood of mul- 
tiple fault candidates. The upper bound can be used in 
an optimal branch-and-bound algorithm. The multiple 
fault algorithms for a set of L-ranked multiple fault di- 
agnoses are presented in Section IV. In Section V, we 
consider the multiple fault diagnosis problem with repet- 
itive tests. Several examples are presented in Section VI. 
Finally, in Section VII, we summarize the results and 
discuss future research issues. 

II. Problem Formulation 

The MFD problem in bipartite systems with imper- 
fect tests consists of a bipartite digraph DG = (5, T, E}, 
where 

• S = {si, ..., s m } is a finite set of independent failure 
sources (failure nodes) associated with the system; 

• T = {ti,< 2 i— i*n} is a finite set of n available bi- 
nary outcome tests (test node), where the integrity 
of system failure sources/components/modules can 
be ascertained; 

i 1 1 

• E = {e,-j } is the set of digraph edges (links) specify- 
ing the functional information flow between the set 
of failure sources and the set of tests in the system. 

The input requirements of the failure nodes and edges 
of the digraph are as follows: 

1. Failure node : A priori probability vector of failure 
nodes P — [p(«i), ...,p(s m )], where p(s.) > 0 is the 
a priori probability of failure source «j. 

2. Link (edge): A set of probability pairs Pi: = 

( Pdij , Pfij) representing the detection-false-alarm 
probabilities of the set of tests, where Pdij and Pfij 
are the detection and false alarm probabilities of test 
tj and failure source Si, respectively (see Figure 1). 
Figure 2 shows a bipartite digraph model. 

The problem is to find the most likely candidates X C 
S that are consistent with the results of applied tests. 
This is formulated as: 

max PK>b(*|T„2» (1) 
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Figure 1: Detection-False- Alarm Probability of Failure 
Source S{ and Test tj 

Failure Sources Tests 



Figure 2: Illustration of the Bipartite Digraph Model 

where T p C.T and Tj CT denote the set of passed and 
failed tests, respectively. Using Bayes’ rule and elimi- 
nating the constant factor Prob (T p ,Tf), we obtain the 
following equivalent maximization problem: 

max ProbCTy.^l^ProbtX) (2) 

For notational simplicity, we define binary vector x 
of size m, where x, — 1 if failure source s,- £ X; Xj = 0, 
otherwise. Note that, given a multiple fault candidate X, 
the tests are independent. Thus, the above probabilities 
can be evaluated as follows: 

Prob(T p |X)= JJ PTob(0(t k )=p\X) (3) 

t k er p 

Prob(7)|A) = H Prob (Ofe) = f\X) (4) 

Prob (0(tj) = p|A) 

m 

i=l 

m m -I pj 

= [nP-^'Mllfr-^f )“] < 5 > 

»=1 «=1 1 


Prob(0(tj) = flX) = 1 - Prob(0(tj) = p\X) (6) 

m 

Prob(X) = 

t= 1 

m m , v 

= [II(i-P(»<))]tn(rr^))*‘J « 

where 0(tj) € {p(=pass), /(=fail)} is the outcome of 
test tj, and Pdij = 0 and Pfij = 0 for e,j E. 

III. Problem Solution 

One approach for generating the optimal multiple fault 
diagnosis is to consider all possible combinations of fail- 
ure sources, i.e., the power set 2 s , and select the mul- 
tiple fault candidate with the highest likelihood func- 
tion in (1). However, the computational complexity of 
this approach is exponential in the number of failure 
sources m. In the following, we present an algorithm, 
based on Lagrangian relaxation and subgradient opti- 
mization method, to generate a near-optimal solution for 
this problem. 

By substituting (3) and (4) into (2) and taking the 
natural logarithm of the resulting objective function, the 
problem is equivalent to: 

max ^ t . eT/ ln(Prob(0(t i ) = /|A)) + 

J2t k er p ln(Prob(0(t*) = p|X)) + 

ln(Prob(X)) (8) 

By substituting (5), (6) and (7) into (8), the problem 
reduces to: 

max E tjeTj ln ( x ~ [II£i ( ffv ) x ‘ ] [TEL i (Pfij )] ) + 

E <k 6 T~{Er=i *i Wife ) + E t m =1 HTfik)} + 

E£i{*iMPi) + ln(l-p(*))} (9) 

where Pi = ~Pfij = I - Pfij and ~Pd {j - 

1 — Pd(j for i = 1, m and j = By 

(i) eliminating constant factors E£Li l n ( 1 — p( s »)) and 
Et t €T, E£Li ln(jP/,j.), an d (ii) defining new variables 

VJ = m?=^) Xi ]SHr=i(Pfij)] ^ tj € Tj, and tak- 
ing the natural logarithm of it, the problem reduces to 
the following optimization problem: 

max J{x,y) = Et i€ T, M 1 ~ %) + 

E£i *.{E tke r, + ln(P.)> (10) 
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m p J . m 

subject to : ln(yj) = Y Xi K^f-) +'£,H?fij) (U) 

i=l ' *J i=l 

0 < y^ < 1 for tj € T f (12) 

Xi = 0 or 1 for i = 1, ..., m (13) 


where y = [yi, 2 /|t>| 3> an d |-| denotes set cardi- 
nality. For simplicity, we define new variables hj =, 
Yl'iLi ln(P/ fj -) f° r tj € Tj . Note that, if we define 

Pc,j=(=|^) for i = l,...,m and j — 1 ,...,n, then Pcy, 

hj and p,- are sufficient statistics for solving this problem. 
The following lemmas present two important properties 
of the MFD problem. 

Lemma 1 : If P/ ifc = 0 and Pdik = 1 for any passed test 
ffc, then the optimal solution does not contain failure 
source s,-, i.e., x,=0 (or equivalently Si £ X). 

Proof : If Si € X (or a:,- = 1), then the second part 
of the objective function in (10), and, consequently, the 
overall objective function will be unbounded, i.e., it 
would be -oo. 

Using Lemma 1, the size of the MFD problem can 
be reduced by removing all failure sources {si\Pfik=0, 
Pdik — 1 and tfc € T p } from the problem. 

Lemma 2 : If the false alarm probabilities of a failed 
test ij are zero, i.e., Pfij — 0 for i = 1, ...,m, then the 
optimal solution contains at least one x,-=l, such that 
Pdij > 0. That is, the optimal solution must cover the 
failed tests. 

Proof : We prove this lemma by contradiction. Pfij= 0 
for i = 1, ...,m results in hj= 0. If for all Pdij > 0, 
x,=0, then we have ln(y 7 )=0 and, hence, yj = 1. Thus, 
ln(l — yj), the first part of the objective function in (10), 
and, consequently, the overall objective function will be 
unbounded. 

Using Lemma 2, we define the following constraints: 


— - Ax > e for tj € Tf and hj = 0 (14) 

where A = {ay} is a binary matrix of size \H\ xm; 
H = {tj € Tj\hj = 0 for j = l,...,n}; each row l of 
matrix A corresponds to a failed test tj with hj = 0; 
aji=l, if Pdij > 0 for i = 1, ..., m; otherwise, a/ t =0, and 
e is a vector of l’s. 

Adding the set of constraints (14) to the problem in 
(10)-(13) results in a smaller search space and tighter 
upper and lower bounds (best feasible solution found), 
and, therefore, a better estimate of the optimal solution. 

Lemma 3 : When all tests are perfect, that is, Pdy = 1 
and Pfij = 0 for i = 1, j = l,...,n and ey € E, 
using Lemmas 1 and 2, the problem reduces to the fol- 
lowing set-covering problem: max* I3«,gs- Pi x > subject 
to (13) and (14), where S~ is the reduced set of failure 
sources, i.e., S after eliminating the failure sources satis- 
fying Lemma 1. 


Proof : This lemma can easily be proved by Lemmas 1 
and 2. Pdij = 1 and P/y = 0 for the failed tests results 
in hj — yj =0. Therefore, the first part as well as the 
second part (using Lemma 1) of the objective function in 
(10) can be eliminated, and the problem reduces to the 
traditional set covering problem. The set covering prob- 
lem can be solved optimally by any optimal set-covering 
algorithm [2, 7], or near-optimally via a Lagrangian re- 
laxation and subgradient optimization method [1]. 

By relaxing the constraints in (11) via Lagrange mul- 
tipliers { Aj}, we obtain the Lagrangian function: 

max <2(A, x, y) = V {in(l - yj) + A j ln(yj)} + 

£>y _ 

Y, ln (Sr‘) + ln (p«) ~ 

«=1 

E '»(=-))- E x > h > < I5 > 

tj€T f tjtTf 

m 

= Y fj( X j>yj) + Y Ci (^ Xi - Y X i h i 

tj€Tj i= 1 tjtTj 

subject to (12), (13) and (14), where fj(^j,yj) and c,(A) 
denote the first and second equations in the brackets in 
(15), respectively. The important point here is that the 
maximization of Lagrangian function in (15) with respect 
to x and y can be carried out independently for each fixed 
A . Maximization of Q( A, x, y) with respect to y is equiv- 
alent to: 

max fj (Xj , yj ) = ln(l - yj) + A j In (yj) for tj € T) (16) 

The maximum of this function is y} (Aj ) = j^~u(Aj ). At 
the value of y}(Aj), the first and second derivatives of the 
function are zero and negative, respectively, indicating 
that fj (Xj , yj) is a maximum (where u(.) is the unit step 
function). 

The maximization of the Lagrangian function 
<3(A, x, y) with respect to x is equivalent to: 

max W( A , x) = 1 a ( A )x , (17) 

X_ 

subject to (13) and (14), which is a traditional set- 
covering problem. This problem has been extensively 
studied by the operations research and management sci- 
ence communities [2, 7]. There exist a number of opti- 
mal algorithms, based on feasible solution exclusion con- 
straints, Gomory /-cuts and tree-search procedures for 
this problem [2, 7]. Let x*(A) be the optimal solution of 
this set-covering problem. Thus, Q(A, z*(A), y*( A)) is 
an upper bound for the optimal objective value in (10). 
This result is summarized in the following Lemma: 
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Lemma 4: Let J* be the optimal value of the objective 
function in (10). Then Q( A, z*(A), y* (A)) > J* for any 
A. 

Proof: Let x° and y° be the optimal solution of the 
problem in (10). Thus, Q(X,x°,y°) < Q(X, x*(X),y*(X)). 
This is because, x*(X) and y*(A) are optimal with respect 
to the relaxed problem in (15). Since the optimal solu- 
tions x° and y° satisfy (11), we have Q(X,x°,y°) = J *, 
and therefore, J* < Q(X, x*(X),y*(X)). 

After evaluating the optimum values z*(A) and y*( A) 
for a fixed A, the problem reduces to one of minimizing 
the upper bound Q( A)= Q(X, x*(X ), y*(A)). Since Q( A) is 
a piecewise function of A, this problem cannot be solved 
using differentiable optimization algorithms. As an al- 
ternative, we use a subgradient optimization algorithm 
[13] to produce a sequence of upper bounds for Q( A). 

If we denote by Q*, the optimal Lagrangian func- 
tion value, i.e., Q*= Q(X*)=mm\Q(X), the difference 
(Q* — J*) is termed the exact duality gap. Since the 
problem in (10)-(14) is NP-hard [4], we may never know 
the global optimal solution J*. Instead, we construct 
several feasible solutions to this problem from the La- 
grangian function solution, and select the best feasible 
solution from the set. Let J(X* ,x/ ,y f ) be the best fea- 
sible value, then we have, J(X* ,x* ,yf) < J* < Q*. A 
nice feature of the Lagrangian relaxation method is that 
the approximate duality gap: 

Q*—J(X*,s?,yf)= (Q* — —J(X*,x? ,yf)) > 0 

(18) 

provides an overestimate (by the value of the exact du- 
ality gap, (Q* — J*) ) of the error between the global 
optimal solution and the best feasible solution found. 
Thus, in some cases, even though the best optimal so- 
lution found is the optimum solution of the problem, the 
approximate duality gap may be nonzero, see Example 1 
in Section VI. Based on extensive computational experi- 
ments, the relative approximate duality gap, 6J, defined 

by: 


6J~ 


Q* — J(k*,x f ,y f ) 
J(X\x/,yf) 


(19) 


is small for the multiple fault diagnosis algorithms (typi- 
cally less than 5%). The pseudocode of the multiple fault 
diagnosis algorithm is presented in the next section. 


A. Multiple Fault Diagnosis Algorithm 

Let (x/ , yf), Qmin , Qub and Qn be the best feasible 
solution found, minimum upper bound, current upper 
bound and maximum lower bound (function value based 
on the best feasible solution found, i.e., J{xj ,yj)) for 


Q(X,x,y), respectively. The pseudocode of multiple fault 
diagnosis algorithm is shown in Figure 3. 

Initialization: Initialize: (1) Xj = 1 for j = 1, ..., (J'/|, 
(2) Qmin — oo, (3) Qw = — oo, and (4) set iteration count 
t — 1. The reason for initializing Xj = 1 is that it results 
in y? = 0.5. 

Step 1: Find optimum values x*(X) by solving the 
set-covering problem in (17). 

Step 2: Find optimum values y*( X) 

where yf(Xj) = ~^~u(Xj) for j = 1, ..., |T/|. 
Step 3: Evaluate y(x*( A)) using equation (11). 

Step 4: Update xj , y* , Qmin, Qub and Qi b as follows: 

• If J r (x*(A),y(x*(A))) > Qn, then 
xj = x*(X), yf = y(z*(A)), 

and Qib = J(x*(X),y(x*(X))), 

• Qub = Q(X,x*(X),y*(X)), 

• Qmin — min(Qtninj Qub')- 
Step 5: Calculate the subgradient 

dj = ln(^) - {YZi *1 (A) In(^) + hj} 

for j = 1, ..., |T/| . 

Step 6: Stop if | dj = 0 since in this case we 
cannot define a suitable step size. 

Step 7: Define a step size /? by 0 — — WTr 1 ^ 

( 2^=i d V 

where initially / = 2. If Q m in has not 
decreased in the last 10 iterations of the 
subgradient procedure with the current value 
of /, then / is halved. This approach to 
deciding the value of / is based on the 
procedure of Fisher [6]. The parameter a with 
typical value 1 < a < 1.1 is to ensure that /? 
does not become too small as the gap between 
Q u b and Qn decreases [1]. 

Step 8: Stop if / < 0.05 or t > 100 (or any other 
suitable stopping criteria). 

Step 9: Update the Lagrange multipliers Xj as follows: 
Xj = max(0, Xj + /3d j) for tj € J 1 /, t <— t + 1, 
and go to step 1. 

Figure 3: Pseudocode of MFD Algorithm 


B. Improving the Computational Complexity 

The computational complexity of MFD algorithm for all 
steps except the first step is 0(m|T/|). It is well known 
that the set-covering problem is NP-hard [11], and there- 
fore, the first step of the multiple fault diagnosis algo- 
rithm limits the size of the problem that we can solve. 

One of the important points here is that a near-optimal 
solution as well as an upper bound solution for the set- 
covering problem can be found via Lagrangian relax- 
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ation method [1] in a manner similar to the MFD al- 
gorithm. Let x"(A) and x“(A) denote the near-optimal 
(best feasible solution found) and upper bound solution 
for the set-covering problem, respectively. Note that 
any feasible solution for the set-covering problem is a 
feasible solution for the multiple fault diagnosis prob- 
lem. However, for a given A, the best feasible solu- 
tion for set-covering may not be the best feasible solu- 
tion for the multiple fault diagnosis problem. Therefore, 
we have: J (x n (A) , y(x n (A)) ) < J* < Q(X, x*(A),y*(A)) 
< Q( A,x u (A), j/’*(A)). Thus, using x u (A) and x n (A), we 
can generate a sequence of upper and lower bounds to 
the multiple fault diagnosis problem. In this case, the 
multiple fault diagnosis algorithm should be modified as 
follows: replace the optimal solution x* (A) in the algo- 
rithm with the near-optimal solution x n (A) , except in 
Q u t where x*(A) should be replaced by the upper bound 
solution x u (A). By this modification, the computational 
complexity of this approach reduces to 0(m\Tj\), and 
therefore, can be applied to large-scale systems. Note 
that, because of storage complexity of storing Pdij and 
Pfij for all failure sources and tests, the available mem- 
ory of a given computer may limit the largest size of the 
problem that we can solve. 

In large-scale systems, it is practical to assume that 
the detection and false alarm probabilities of each test 
tj is the same for all failure sources connected to it, i.e., 
Pdij=Pdj and Pfij=Pfj, if e,-j € E, otherwise, Pdij= 0 
and Pfij= 0. In this case, we define a binary reachability 
matrix R = {r,y} such that r tJ - = 1 if £ E, otherwise, 
rij = 0. The detection and false alarm probabilities of 
each test tj for each failure source s,- can be evaluated 
as follows: Pdij = Pdj and Pfij = rijPfj. Note 
that, in this case, the binary matrix R = can be 
stored in a bit-compact format, and consequently, the 
storage complexity of the. problem reduces by a factor 
of approximately 2 K, where K is the number of bits for 
representing a floating variable in a given computer. For 
example, the storage complexity of the MFD problem 
for a system with 10, 000 failure sources and tests when 
K=32 bits (or equivalently 4 bytes) are 800 Mbytes for 
storing Pdij and Pfij , and 12.5 Mbytes for storing the 
binary matrix R = {r,j}. However, by storing Pdj, pfj 
and R = {r,j}, the total memory required reduces to 
12.6 Mbytes. 

Despite the complexity analysis results for the com- 
binatorial nature of multiple fault problem, the optimal 
solution for this problem can be found via a branch-and- 
bound. In the branch-and-bound algorithm: (1) a bi- 
nary tree is employed for the representation of the 0-1 
combinations, (2) the feasible region is partitioned into 
subdomains systematically, and (3) valid upper and lower 
bounds are generated at different levels of the binary tree. 


The main objective in a general branch-and-bound algo- 
rithm is to perform an enumeration of the alternatives 
without examining all 0-1 combinations of failure sources. 
Details of branch-and-bound algorithms can be found in 
any integer programming textbooks, e.g., [8, 13, 17, 18]. 

IV. Ranked Set of Most Likely 
Candidates 

In this section, we consider the problem of determining 
a ranked set of solutions to the multiple fault diagnosis 
problem. That is, the problem is to find L sets of most 
likely candidates. We present the following sequential 
approach to solve this problem: 

Initialization: Find the first most likely candidate A' 1 
for the multiple fault diagnosis problem. 

Algorithm: 

DO for 1 = 2 , ..., L, or until no feasible solution exists, 
Eliminate the set of candidates { X 1 ,...^ 1-1 } 
from the problem and generate the /- th mo c ‘ likely 
candidate. 

END 

The first part of the algorithm, i.e., initialization, can 
be solved by the algorithm of previous section. In this 
section, we present an approach to solve the second part 
of the sequential algorithm. In this approach, we solve 
a series of modified copies of the initial multiple fault 
diagnosis. 

A. Ranked Algorithm: Modified Copies of 

MFD Problem 

In this approach, at each iteration, we solve a series 
of multiple fault diagnosis problems assuming that the 
states of some of the failure sources are known prior to 
diagnosis, i.e., some failure sources are known good, and 
some of them are known bad (definitely faulty). A similar 
approach has been considered by Murty [12] for deter- 
mining a ranked set of solutions to assignment problems, 
and was recently enhanced by Cox et. al. [5] within the 
context of multi-target tracking. For simplicity, we rep- 
resent the multiple fault diagnosis problem by four-tuple 
T=(MFD, G, B, X ), where 

1. MFD is the problem in (10)-(14), 

2. G C S represents the set of known good failure 
sources, i.e., for all s< € G, x,- = 0 (or s; X), 

3. B C S represents the set of definitely faulty failure 
sources, i.e., for all s,- € B, x; = 1 (or s; £ X), 
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4. X is the optimal solution to the MFD problem sub- 
ject to G and B. 

Note that the number of unknown failure sources in 
T=(MFD, G, B, X) is m - |G| - |2?|. Initially, G and B 
are empty, i.e., T 1 — (MFD, 0, 0, X 1 ). Subsequent solu- 
tions to T 1 are found by solving a succession of multiple 
fault diagnosis problems that are created from T 1 by a 
process called partitioning. A problem, T, with the best 
solution X and size m—\G\ — \B\, is partitioned into a 
set of subproblems, Ti, F m _|<;|_|B|+ii such that: 

• The union of the set of possible solutions to Ti 
through r m _| G |_| B |+i is exactly the set of possible 
solutions to T, 

• The sets of possible solutions to Fi through 
r m _|Gf|_|j 3 |+i are disjoint, and 

• r m _|G|_|B|+i has only one solution X. 

Let us assume that F r is a dummy subproblem 
that is used to generate the subproblems Fi through 
Tm— |<?|_|i?|+i from T. The following procedure shows: 
(1) how to update the subproblem T r = (MFD, G r , 
B r , X r ), and (2) how to make subproblem T; = (MFD, 
Gi, Bi, X; ) form T r for l = 1 |G| — \B\, se- 

quentially, and finally, (3) r m _|G|_|B| + i = r r . Initially, 
r r =r. Then, for / = 1, m— |G| — |B|, r r is partitioned 
as follows: 

• Select any Si € S — (G r UB r ), 

• If Si € X, then G, «- G r U {s,} and B r «- B r U {«,•}, 
else B, *- B r U {s t } and G r «- G r U {s,}. 

Note that, at each iteration, the problem r r is partitioned 
into two disjoint subproblems. This is because we force 
the subproblems to be different in the status of only one 
failure source s t - in the system, i.e., we add s,- to the 
set of definitely faulty failure sources in one subproblem, 
and to the set of known good failure sources in another 
subproblem. In addition, X cannot be a solution to T; 
for 1=1, ..., m— |G| — |jB|. Further more, T r is the only 
subproblem which contains X and only X as its solution. 
This is because B r =X and G r —S — X. 

As an illustration, let us consider a simple system with 
3 failure sources {«i, s 2 , s 3 }. In addition, let us assume 
that the optimal solution for the MFD problem in this 
case is X={ Sl }, i.e., T 1 = (MFD, 0, 0, X 1 = {si}). 
Therefore, the MFD problem can be partioned into the 
following subproblems; Ti = (MFD, G = {si}, B = 0, 
Xr); r 2 = (MFD, G = 0, B = {s u s 3 }, X 2 ), T 3 = 
(MFD, G = {s 2 >, B = {si, s 3 }, X 3 ), and T 4 = (MFD, 
G — {s 2 ,s 3 }, B = {si}, X 4 ). 


Therefore, we partition T 1 according to its best so- 
lution X 1 , and place the resulting subproblems to- 
gether with their best solutions, except the last one, i.e., 
r m _| G |_|B|+i, on a priority queue of four-tuple (MFD, 

G, B, X). We then find a problem in the queue that 
has the best solution. The solution of this problem is the 
second-best solution to the multiple fault diagnosis prob- 
lem. Now, we remove this problem from the queue and 
replace it by its partitioning. The best solution found in 
the queue now is the third-best solution to the multiple 
fault diagnosis problem, and so on. The pseudocode for 
the L-ranked algorithm is shown in Figure 4. 

Initialization: Find the first solution X 1 to MFD 
problem, and initialize a priority queue of four-tuple 
problems to contain only T l =(MFD, 0, 0, X 1 ). The top 
problem on this queue will always be the problem with 
the highest likelihood solution. 

Step 1: Clear the list of solutions to be returned. 

Step 2: DO until priority queue of problems is empty. 

Step 2.1: Take the top problem 

I— (MFD, G, B, X) off the queue. 

Step 2.2: Add X to the list of solutions. 

Step 2.3: If the cardinality of solution set is L, Stop. 

Step 2.4: Let r r =r, 

Step 2.5: DO for l = 1, ..., m — |G| — |f?|, 

Step 2.5.1: Partition F r into r r and F* as follows: 
Step 2.5.2: Select any Si 6 S — (G r U B r ), 

Step 2.5.3: If si € X, then 

G' «- G r U {sj and B r <- B r U {*,-}, 
else B' — B r U {s f } and G r <- G r U {*•}• 
Step 2.5.4: Find the best solution X' to T . If X 

exists, add (MFD, G , B , X ) to the queue. 
END 

END 

Figure 4: Pseudocode for L-Rank MFD Algorithm 

Since each subproblem is NP-hard, we use the near- 
optimal MFD algorithm of previous section to solve 
the ranked set problem near-optimally, i.e., X is a near- 
optimal solution for the problem T=(MFD, G, B, X). 
Thus, it is possible that /-th solution, i.e., X 1 , has higher 
likelihood than the jfe-th solution, i.e., X k , where k > /. 
Note that, we perform one partitioning for each of the L- 
best solution, in the worst case, each partitioning creates 
0(m ) new problems. This creates up to O(Lm) multi- 
ple fault problems and insertions on the priority queue. 
Each problem takes at most 0(m|T/ 1) time to solve near- 
optimally, and each insertion takes at most 0(log(Lm)) 
time. Therefore, the worst-case execution time of this 
approach is G(Lm(m|T/| + log(Lm))), or approximately, 
0(Lm 2 |7)|). 
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V. Multiple Fault Diagnosis with 
Repetitive Tests 

A reasonable and common situation in unreliable test- 
ing is to apply a test several times to improve the con- 
fidence about a given hypothesis (a set of multiple fault 
candidates). For example, in order to reduce the prob- 
ability of error, i.e., false alarm and missed detection of 
some faults (disorders or diseases), a system (a patient) 
may be tested multiple times, and because of imperfect 
nature of tests, the test results may be different. In this 
section, we assume that each test tj has been applied re- 
times in which it passed and failed fxj and tjj times, re- 
spectively, i.e., rij = fij + rjj. Note that applying a test 
at different times is equivalent to applying independent 
tests with the same structure. In this case, let us assume 
that Tj and T p denote the set of failed and passed tests 
(without any redundancy), respectively, and Tff\T p may 
not be empty. Thus, the problem is: 

max J(x, y) = £ t 6T rjj ln(l - yj ) + 
y J J 

E,"E *«{E< fe 6 T P a** ln (ff^-) + ln (p*')} ( 20 ) 

subject to (11)-(14). This problem is similar to the prob- 
lem in (10). Thus, the algorithms in previous sections can 
be readily applied to solve this problem. In this case: (1) 
in the first step of the MFD algorithm, c,(A) is a func- 
tion of yk for k = 1, ..., |2],|, i.e., the number of time 
that test tk passed, and (2) in the second step of the 
M FD algorithm, the optimum of the objective function 
with respect to y is replaced by y*j{\j) = - u(A j) for 

i = i,-,|2>|. 

VI. Examples 

Example 1: In this example, we consider: (1) a simple 
diagnostic problem with m = 20 failure sources (disor- 
ders) and n = 20 tests (manifests) which was used as an 
example in [3]; ( Example l.a- l.d), and (2) a diagnostic 
problem with m = 15 failure sources and n = 10 tests 
from [9]; (Example I.e). The false alarm probabilities for 
these systems are all zero, i.e., P fij — 0 for i — 1, ..., m 
and j = 1, ..., n and T p = T - Tj. Figures 5 and 6 show: 
(1) the set of failed tests T/, (2) diagnostic results, (3) 
likelihood, (4) processing time and total number of runs 
to converge to the diagnostic results, (5) total processing 
time and total number of runs, and (6) approximate du- 
ality gap. The diagnostic results are based on the near- 
optimal multiple fault diagnosis algorithm in Figure 3. 
The processing times for these examples are obtained by 
running the MFD algorithm on a SPARC 10. Binglin 
et. al. [3] presented a genetic algorithm which required 


Ex. 

Ul>,- € T f } 

e x} 

Pjob(JC|T f ,T P ) 

l.a 

{1, 2, 4,5, 7, 8, 13, 15} 

{1,9, 10, 14, 17} 

3.66 e~° 9 

l.b 

{7, 8, 9, 11, 14, 15) 

{4, 5, 17.20} 

1.32e _lu 

l.c 

(1, 3, 4, 6, 7, 11, 13, 15, 16} 

{1,5,9,14, 16,17} 

6 .a 2 e*“ 13 

l.d 

{1, 2, 3, 7, 8, 12, 13, 17} 

{4, 5, ». 14, 19} 

2.49e*“° 9 

I.e 

{1,2, 4,5, 7, 8, 9, 10} 

{3,4.9, 12, 13} 

7.77e“ u ^ 


Figure 5: MFD Algorithm Results for Examples l.a-l.e 


Ex. 

Convergence 

Total 

Approximate 
Duality Gap 

# Runs 

Time (sec) 

# Runs 

Time (sec) 

l.a 

8 

0.170 

58 

0.310 

4.68% 

l.b 

2 

0.009 

65 

0.240 

4.76% 

l.c 

2 

0.050 

68 

0.340 

4 6g% 

l.d 

1 

0.004 

64 

1.83 

4.69% 

i.e 

2 

0.007 

58 

0.15 

4.52% 


Figure 6: MFD Algorithm Results for Examples l.a-l.e 


10 minutes to find the set of diagnoses in Example l.a 
with an IBM PS/2 Model 40 SX-20 MHZ microcomputer. 
These results show the superior performance of our al- 
gorithm compared to the algorithm in [3]. Miller et. al. 
[9] have not reported the processing time for Example 
I.e. However, the largest problem that they considered 
contained 20 failure sources and 15 tests. 

Example 2: In this example, we consider systems with: 
(1) m=n=100, m=n=500 and m=n=1000, (2) the prob- 
ability of each failure source is set to a random number 
between (0.001, 0.5), (3) each test, on average, covers 5, 
10 and 20 failure sources, (4) detection probabilities of 
a test associated with its covered failure sources are set 
to random numbers between (0,1), (5) the false alarm 
probabilities are assumed to be zero, and (6) the num- 
ber of failed tests are 5, 10 and 20. Figures 7, 8 and 9 
show the simulation results for these systems. Each row 
of these Figures represents the average of simulation re- 
sults for 5 randomly generated systems. Note that, in 
most of the cases, the average approximate duality gaps 
are around 5%. However, in some of the cases, for ex- 
ample, the last row of Figure 7, the approximate duality 
gap is very large, i.e., 22.15%. In order to improve the 
solution (or, equivalently, approximate duality gap), we 
can apply the L-ranked algorithm. The average approx- 
imate duality gap based on 2-ranked algorithm for the 
last set of systems in Figure 7 reduces to 1.49%. 
Example 3: In this example, we consider three systems 
with 10 failure sources and 10 tests as in [15]. The false 
alarm probabilities are assumed to be zero. The sim- 
ulation results for 2 10 possible combinations of test re- 
sults are shown in Figure 10. The second column shows 
the number of correct cases out of 1024 possible com- 
binations of test results. The third column shows the 
weighted probability of correct cases. The columns cor- 
responding to Nd and Nj denote the unweighted prob- 
abilities of detection, i.e., the unweighted probability of 
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Average 

Test 

Coverage 

|T/| 

Conver 

f epce 

Total 

Approximate 

Duality 

Gap 

# Runs 

Time 

(»«) 

# Runs 

Time 

(•««) 

5 

5 

2 

0.10 

62 

2.70 

4.25% 

5 

10 

16 

0.38 

63 

3.11 

3.75% 

5 

20 

4 

0.83 

70 

13.78 

4.79% 

10 

5 

3 

0.25 

63 

3.35 

6^75% 

10 

10 

12 

2.91 

60 

12.00 

5.26% 

10 

20 

9 

3.01 

83 

31.00 

9.59% 

20 

5 

1 

0.11 

55 

6.97 

6.12% 

20 

10 

2 

0.85 

58 

28.61 

10.96% 

20 

20 

16 

23.39 

59 

90.72 

22.15% 


Figure 7: Simulation Results for m=n=100 


Average 

Test 

Coverage 

|T/| 

Conver 

fence 

Total 

Approximate 

Duality 

Gap 

# Runs 

Time 

(sec) 

# Runs 

Tim. 

<*.c) 

5 

5 

1 

0.54 

62 

23.56 

4.15% 

5 

10 

3 

1.62 

72 

44.58 

3.96% 

5 

20 

16 

7.15 

66 

32.77 

4.23% 

10 

5 

1 

0.48 

58 

26.99 

4.82% 

10 

10 

1 

1.20 

69 

46.68 

4.60% 

10 

20 

15 

19.25 

67 

85.37 

6.60% 

20 

5 

1 

0.56 

51 

30.41 

3.75% 

20 

10 

6 

6.55 

64 

60.38 

3.00% 

20 

20 

11 

26.19 

64 

184.06 

16.03% 


Figure 8: Simulation Results for m=n=500 


common faulty failure sources in the optimal and near- 
optimal solutions, and false alarm, i.e., the unweighted 
probability of faulty failure sources in the near-optimal 
solution and not in the optimal solution. Figure 11 shows 
the simulation results based on the 2-ranked algorithm. 
The average weighted (unweighted) accuracy based on 
the MFD algorithm and 2-ranked algorithm are 97.71% 
(94.99%) and 99.96%(99.77%), respectively. 

Example 4: In this example, we apply the MFD and 
L-ranked algorithms to the medical example in [14, 19]. 
The system under consideration is for neuropsychiatric 
diagnosis. The system consists of 26 disorders (failure 
sources) from psychiatry and neurology which affect men- 
tal status. A list of 56 symptoms (tests) and signs was 
assembled for each disorder. There are 384 links in the 
system, each of which connects a disorder to a manifes- 
tation. Similar to [14] , five groups of test cases are used 
to test the MFD and L-ranked algorithm. Manifesta- 
tions are chosen randomly from the total set of 56 possi- 


Average 

Test 

Coverage 

|T/| 

Convergence 

Total 

Approximate 

Duality 

Gap 

# Runs 

Time 

(,«c) 

# Runs 

Time 

(«c) 

5 

5 

2 

3.39 

67 

102.09 

5.89% 

5 

10 

2 

3.53 

73 

122.86 

5.12% 

5 

20 

2 

5.02 

67 

138.56 

4.59% 

10 

5 

1 

2.27 

54 

67.95 

4.09% 

10 

10 

2 

3.63 

55 

99.53 

4.27% 

10 

20 

5 

11.90 

66 

169.60 

4.99% 

20 

5 

1 

3.26 

53 

103.29 

4.25% 

20 

10 

11 

30.29 

57 

137.15 

4.94% 

20 

20 

28 

139.14 

76 

374.54 

9.42% 


Figure 9: Simulation Results for m=n=1000 


Example 

j Correct cases j 

”i 

"s 

# Times (out of 1024) 

Weighted 

3. a 

992 (96.85%) 

99.91% 

98.63% 

0.39% 

3.b 

971 (94.82%) 

98.61% 

~ 97.60% 

0.31% 

3.c 

955 (93.26%) 

94.61% 

97.66% 

0.37% 


Figure 10: MFD Alg. Results for Examples 3.a-3.c 


Example 

Correct cases 

Ni 

N S 

# Times (out O rX024) 

Weighted 

3.a 

1024 (100%) 

100% 

100% 

0.00% 

3.b 

1019 (99.51%) 

99.92% 

99.77% 

0.06% 

3.c 

1022 (99.80%) 

99.95% 

99.92% 

0.03% 


Figure 11: 2-ranked Alg. Results for Examples 3.a-3.c 


ble manifestations based on a uniform distribution. Each 
group of test cases consists of ten different sets of manifes- 
tations. Each case in the first test group has one present 
manifestation (failed test); each case in the other groups 
have 3, 5, 7 and 9 manifestations. If any randomly gen- 
erated test result is inconsistent with the causal network, 
the case is discarded and a new one is generated. The 
inconsistent test results may occur because the causal 
network used in the experiment has some perfect tests, 
i.e., Pdij = 1 and Pfij = 0. Thus, after applying the first 
Lemma, and reducing the size of the problem, the sec- 
ond Lemma may not be satisfied, i.e., there exists a failed 
test that is not covered by any failure source. Simulation 
results show that among all 50 cases MFD algorithm 
and 2-ranked algorithm generate 98% and 100% optimal 
solutions, respectively. Peng and Reggia applied their 
competition-based connectionist methods to this causal 
network. Their algorithm generated 74% of globally op- 
timal solutions, and 90% of one of the three globally op- 
timal solutions. 

VII. Conclusion 

In this paper, we considered the problem of construct- 
ing optimal and near-optimal multiple fault diagnosis 
in bipartite systems with unreliable (imperfect) tests. 
We presented a multiple fault diagnosis algorithm based 
on Lagrangian relaxation and subgradient optimization 
method, which provides near optimal solutions for the 
multiple fault diagnosis, and upper bounds for an optimal 
branch-and-bound algorithm. Computational results in- 
dicate that our algorithm can be used in systems with 
as many as 1000 faults. In addition, we presented an 
algorithm to generate the set of L-ranked multiple fault 
candidates. In this algorithm, we find the most likely 
candidate using the near optimal multiple fault diagno- 
sis algorithm. Then, we partition the problem, based on 
the first solution, to a set of disjoint subproblems. The 
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solutions to these subproblems with the highest likeli- 
hood represents the second most likely candidates. This 
procedure is continued until L-ranked multiple fault di- 
agnoses are found, or no more feasible solutions exist. 
We showed that the computational complexity of this 
approach is 0(Lm 2 |T/|), and therefore, applicable for 
systems with as many as 1000 faults and tests. Finally, 
we extended the multiple fault diagnosis problem to re- 
dundant or repetitive tests. In this case, the problem is 
very similar to the original multiple fault diagnosis prob- 
lem, and therefore, the MFD algorithm can be extended 
to this problem as well. 

In this paper, we assumed that the test results are 
known prior to diagnosis. That is, we considered the 
problem of multiple fault diagnosis with unreliable tests. 
The problem of sequential multiple fault diagnosis strat- 
egy (testing) with unreliable tests is an important prob- 
lem in field maintenance. Furthermore, the order of par- 
titioning in the i-ranked algorithm may improve the ac- 
curacy of the near-optimal solutions. We expect to in- 
vestigate these challenging issues in our future efforts. 
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Abstract 

In this paper, we consider the problem of constructing optimal and near-optimal multiple 
fault diagnosis (MFD) in bipartite systems with unreliable (imperfect) tests. It is known that 
exact computation of conditional probabilities for multiple fault diagnosis is NP-hard. The 
novel feature of our diagnostic algorithms is the use of Lagrangian relaxation and subgradient 
optimization methods to provide: (1) near optimal solutions for the MFD problem, and (2) 
upper bounds for an optimal branch-and-bound algorithm. The proposed method is illustrated 
using several examples. Computational results indicate that: (1) our algorithm has superior 
computational performance to the existing algorithms (approximately three orders of magnitude 
improvement over the algorithm in [3]), (2) the near optimal algorithm generates the most likely 
candidates with a very high accuracy, and (3) our algorithm can find the most likely candidates 
in systems with as many as 1000 faults. 
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I. Introduction 

With the increased recognition of importance of design for testability, there is an increasing 
trend towards the use of smart sensors for on-bqard system health management. The results of 
on-board tests are available to the ground test systems and operators as a block of symptoms. 
Due to improper set up, operator error, electromagnetic interference, environmental conditions, or 
aliasing inherent in the signature analysis of on-board tests, the nature of tests may be unreliable 
(imperfect). Imperfect tests introduce an additional element of uncertainty into the diagnostic 
process: the pass outcome of a test does not guarantee the integrity of components under test 
(because the test may have missed a fault), or a failed test outcome does not mean that one or more 
of the implicated components are faulty (because the test outcome may have been a false alarm). 
Consequently, the diagnostic procedures must hedge against this uncertainty in test outcomes. 

In this paper, we consider the problem of constructing optimal and near-optimal multiple fault 
diagnosis in bipartite digraphs with unreliable tests. This problem is a central and long-standing 
concern in system fault diagnosis, and medical decision making [10]. When the false alarm probabil- 
ities of all tests are zero, the problem simplifies to the parsimonious covering theory (or probabilistic 
causal model) discussed in [16]. Peng and Reggia [15] proposed a competition based connectionist 
method to subdue the problem of combinatorial explosion in computing the posterior probabilities 
of all possible combinations of failure sources in probabilistic causal models. However, this method 
does not guarantee a global optimum and suffers from large computation times even for problems 
with small numbers of failure sources, m=26. 

Genetic algorithms are offered as an alternative to the connectionist methods [3, 9]. Genetic 
algorithms are based on an analogy with Darwin’s biological evolutionary theory in which a group of 
solutions evolves via natural selection. It emulates the rules of biological evolutionary process, such 
as reproduction, crossover, mutation, and natural selection, etc. At each iteration, a population of 
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individuals is established, where each individual corresponds to a point in the search space. The 
objective function is evaluated for each individual to rate its fitness. Then, a next generation is 
formed based on the survival of the fittest. Therefore, the evolution of individuals from generation to 
generation tends to result in fitter individuals (i.e., solutions) in the search space. These algorithms 
converge extremely slowly, and have been applied to small problems with m=‘20 failure sources 
(causes, disorders) and n=20 tests (manifestations, symptoms). 

Wu [20] proposed a decomposition method based on common and disjoint causal (failure source) 
relationships among the given symptoms (tests). This method decomposes the original problem 
into smaller and independent subproblems, and therefore, increases the performance and efficiency 
of multiple fault diagnosis. However, this approach is not applicable for systems with large numbers 
of njyndecomposable causes and symptoms. 

In this paper, we present a novel approach, using Lagrangian relaxation, to solve multiple fault 
diagnosis problem. By defining new variables and constraints, the multiple fault diagnosis ( M FD } 
problem reduces to a combinatorial optimization problem with a set of equality constraints. The 
constraints are relaxed via Lagrange multipliers. The relaxation procedure generates an upper 
bound for the objective function. The procedure of minimizing the upper bound via a subgradient 
optimization produces a sequence of solutions that are modified, in a computationally effective 
way, to produce a sequence of feasible solutions to the MFD problem. If the objective function 
value for the best feasible solution and the upper bound are the same, the feasible solution is the 
optimal solution. Otherwise, the difference between the upper bound and the feasible solution, 
termed the approximate duality gap, provides a measure of suboptimality of the MFD solution. 
Alternatively, the optimal solution can be found via a tree search (or branch-and-bound) procedure. 
The computational complexity of the near-optimal algorithm is a linear function of the number of 
failure sources, m and the number of failed tests, \Tj\. 

Next, we present an approach to determine a ranked set of multiple fault diagnosis solutions 
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(i.e., the best, second best, ..., L-th best diagnosis). In this approach, following Murty [12] and 
Cox et. al. [5], we: (1) partition the MFD problem, based on its best solution, into disjoint 
subproblems; (2) solve the subproblems and sort them by the values of their solutions, and (3) 
select the subsequent best solutions. One of the advantages of this approach, compared to the 
one in [14], is that since the subproblems are disjoint, the optimal solution of each subproblem is 
different from the others. Finally, we show that the MFD algorithm can be extended to solve 
multiple fault diagnosis problems with repetitive application of tests. 

The paper is organized as follows. In Section II, we formulate the multiple fault diagnosis prob- 
lem in a bipartite system. In Section III, we present a near-optimal algorithm based on Lagrangian 
relaxation and subgradient optimization method to diagnose multiple faults, and generate an upper 
bound for the likelihood of multiple fault candidates. The upper bound can be used in an optimal 
branch-and-bound algorithm. The multiple fault algorithms for a set of L-ranked multiple fault 
diagnoses are presented in Section IV. In Section V, we consider the multiple fault diagnosis prob- 
lem with repetitive tests. Several examples are presented in Section VI. Finally, in Section VII, we 
summarize the results and discuss future research issues. 

II. Problem Formulation , . 

The MFD problem in bipartite systems with imperfect tests consists of a bipartite digraph 
DG = {S,T,E},w here 

• S = {i'i, is a finite set of independent failure sources (failure nodes) associated with 

the system; 

• T = {ti,<2i is a finite set of n available binary outcome tests (test node), where the 

integrity of system failure sources/components/modules can be ascertained; 
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Figure 1: Detection- False- Alarm Probability of Failure Source s,- and Test tj 

• E — {e,j } is the set of digraph edges (links) specifying the functional information flow between 
the set of failure sources and the set of tests in the system. 

The input requirements of the failure nodes and edges of the digraph are as follows: 

1. Failure node : A priori probability vector of failure nodes P = [p(si), ...,p(.s m )], where p(si) > 
0 is the a priori probability of failure source s t . 

2. Link (edge) : A set of probability pairs P,j = (Pdij, Pfij) representing the detection-false- 
alarm probabilities of the set of tests, where Pdij and Pfij are the detection and false alarm 
probabilities of test tj and failure source s t -, respectively (see Figure 1). Figure 2 shows a 
bipartite digraph model. 

The problem is to find the most likely candidates X C S that are consistent with the results of 
applied tests. This is formulated as: 

ma^Prob^l^TV) (1) 

where T p C T and T/CT denote the set of passed and failed tests, respectively. Using Bayes’ rule 
and eliminating the constant factor Prob(T p ,T/), we obtain the following equivalent maximization 
problem: 

m^Prob(Tj,T p \X)Prob(X) 


( 2 ) 
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Pn 



Figure 2: Illustration of the Bipartite Digraph Model 

For notational simplicity, we define binary vector x of size to, where x, = 1 if failure source 
S{ 6 A"; Xi — 0, otherwise. Note that, given a multiple fault candidate X , the tests are independent. 
Thus, the above probabilities can be evaluated as follows: 


P rob (T p | A') = IJ Prob (0(t k ) = p\X) (3) 

t k eT p 

Prob(T/|A) = [I Prob(0(tj) = f\X) (4) 

tj£Tf 

m 

Prob(O^) = p\X) = na- Pd n) Xi C 1 - 

*=i 

171 m i _ P//.. 

.=i t=i 1 

Prob (0(tj) = f\X) = 1 - Prob (0(tj) = p\X) (6) 

m 

Prob( AT) = -P(^)) (1_a:i) 

i = l 

= £(1-^.))]©^^)"] (?) 


where 0(tj ) € {p(=pass), /(=fail)} is the outcome of test tj, and Pdij = 0 and Pfij = 0 for 
eij $ E. 
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III. Problem Solution 

One approach for generating the optimal multiple fault diagnosis is to consider all possible 
combinations of failure sources, i.e., the power set 2 s , and select the multiple fault candidate with 
the highest likelihood function in (1). However, the computational complexity of this approach is 
exponential in the number of failure sources m. In the following, we present an algorithm, based on 
Lagrangian relaxation and subgradient optimization method, to generate a near-optimal solution 
for this problem. 

By substituting (3) and (4) into (2) and taking the natural logarithm of the resulting objective 
function, the problem is equivalent to: 

max E ln(Prob(0(tj) = f\X)) + 

X ~ S t^T s 

E ln(Prob(0(f fc ) = p(X)) + 

4jfc€T p 

ln(Prob(J5f)) (8) 

By substituting (5), (6) and (7) into (8), the problem reduces to: 

ra r> j ra 

m g £ ntncp7«>i)+ 

X ~ tj€T f i=l ij *= 1 

T71 P/4 m 

t k €T P t=l rJxk *=1 

m 

EteMw) + M 1 - p( s i))} ( 9 ) 

i=i 

where pi = P/.y = 1 - Pfij and Wy = 1 - Pd t y for i = and j - By 

(i) eliminating constant factors W 1 ~ Pi s *)) and Y,i k eT p ln(P/,-jfc)> an d 00 defining new 
variables % = [FIS: l ( wF ) x ‘ ] [11 ^ i (Pfij )] for € T/, and taking the natural logarithm of it, the 
problem reduces to the following optimization problem: 

ma xJ(x,y) = E M 1 “ Vi) + E 2 *"^ H ln (lnrO + ln (Pi)} ( 10 ) 

-- tj€T } i=i t k eT p 
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TTL p i m 

subject to: ln( % ) == ^x i ln(=y 1 ) + (11) 

i=l “j «i i=i 

0 < Vj < 1 for tj £ Tj ( 12) 

Xj = 0orl for i=l,...,m (13) 


where y_ — [?/i , ..., y\T f \], and |.| denotes set cardinality. For simplicity, we define hj = x ln(P/y) 
for € Tj. The following lemmas present two important properties of the MFD problem. 

Lemma. 1 : If Pdu c = 1 for any passed test tk, then the optimal solution does not contain failure 
source s;, i.e., x t =0 (or equivalently s t - ^ A'). 

Proof : If S{ £ X (or x,- = 1), then the second part of the objective function in (10), and, 
consequently, the overall objective function will be unbounded, i.e., it would be -oo. 

Using Lemma 1, the size of the MFD problem can be reduced by removing all failure sources 
{s t -jP/,7-=0, Pdik = 1 and tk € T p } from the problem. 

Lemma 2 : If the false alarm probabilities of a failed test tj are zero, i.e., P fij—0 for i = 1, ..., m, 
then the optimal solution contains at least one £;=1, for which Pd,j > 0. That is, the optimal 
solution must cover the failed tests. 

Proof : We prove this lemma by contradiction. P/,j— 0 for i = 1, ...,m results in hj= 0. If for all 
Pd{j > 0, x,= 0, then we have ln(yj)—Q and, hence, yj = 1. Thus, ln(l — y 3 ), the first part of the 
objective function in (10), and, consequently, the overall objective function will be unbounded. 
Using Lemma 2, we define the following constraints: 

Ax > e for tj £ Tj and hj — 0 (14) 

where A = {ajj} is a binary matrix of size \H\ Xm; H = {tj £ T/\hj = 0 for j = l,...,n}; each 
row l of matrix A corresponds to a failed test tj with hj = 0; o/,= 1, if Pd,j > 0 for i = 1, ...,m; 
otherwise, a/,=0, and e is a vector of l’s. 


Adding the set of constraints (14) to the problem in (10)-(13) results in a smaller search space 
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and tighter upper and lower bounds (which result in faster convergence), and, therefore, a better 
estimate of the optimal solution. 

Lemma 3 : When all tests are perfect, that is, Pd,j — 1 and Pfij = 0 for i = 1, ..., m, j = 1, ..., n 
and e,-j 6 E, using Lemmas 1 and 2, the problem reduces to the following set-covering problem: 
max* Ys Si eS- subject to (13) and (14), where S~ is the reduced set of failure sources, i.e., 

S after eliminating the failure sources satisfying Lemma 1. 

Proof : This lemma can easily be proved by Lemmas 1 and 2. Pdij = 1 and Pfij — 0 for the 
failed tests results in hj = yj ~ 0. Therefore, the first part as well as the second part (using Lemma 
1) of the objective function in (10) can be eliminated, and the problem reduces to the traditional set 
covering problem. The set covering problem can be solved optimally by any optimal set-covering 
algorithm [2, 7], or near-optimally via a Lagrangian relaxation and subgradient optimization method 

w- 

By relaxing the constraints in (11) via Lagrange multipliers { Aj}, we obtain the Lagrangian 
function: 


max Q( A, x,y) = (ln(l - y : ) + Aj In (yj)} + 

Lex, 

M=f L ) + MPi)~ 2 A ^ n (5p)}- 

*=1 t k €T p & tj&Cj K 

E X i h i U«) 

h i€T/ 

m 

— ^2 Hi) + c t(A)x t - — y: A jhj 

t,eT f i=i i,eT f 

subject to (12), (13) and (14), where fj(^j,yj) and c,-(A) denote the first and second expressions 
in the brackets in (15), respectively. The important point here is that the maximization of La- 
grangian function in (15) with respect to x and y can be carried out independently for each fixed A. 
Maximization of Q(A, x,y) with respect to y is equivalent to: 


o»« fj( X P yj ) = !n(l — yj) + A i Hyj) for € Tf 


(16) 
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The maximum of this function is Vj(Xj) = j^-u(Xj). At the value of y](Xj), the first and sec- 
ond derivatives of the function are zero and negative, respectively, indicating that fj(Xj, yj) is a 
maximum (where u(.) is the unit step function). 

The maximization of the Lagrangian function Q(X,x, y) with respect to x is equivalent to: 

m 

maxWK(A,x) = Y' c,-(A)x; (17) 

X 7 / 

<=i 

subject to (13) and (14), which is a traditional set-covering problem. This problem has been 
extensively studied by the operations research and management science communities [2, 7]. There 
exist a number of optimal algorithms, based on feasible solution exclusion constraints, Gomory 
/-cuts and tree-search procedures for this problem [2, 7]. Let x*(A) be the optimal solution of this 
set-covering problem. Thus, Q( A, x*(A), j/*(A)) is an upper bound for the optimal objective value 
in (10). This result is summarized in the following Lemma: 

Lemma 4: Let J* be the optimal value of the objective function in (10). Then Q (A, x“( A). 
y*(A)) > J* for any A. 

Proof: Let x° and y° be the optimal solution of the problem in (10). Thus, Q(X,x_°,y°) < 
$(A,x*(A),y*(A)). This is because, x*(X) and y*( A) are optimal with respect to the relaxed problem 
in (15). Since the optimal solutions x° and y° satisfy (11), we have Q (A, x° ,y°) = J”, and therefore, 

J* < Q(A,x*(A),y*(A)). 

After evaluating the optimum values x*(X) and y*( A) for a fixed A, the problem reduces to one 
of minimizing the upper bound Q(X)= Q(X, x'(X), y*(X)). Since Q( A) is a piecewise differentiable 
function of A, this problem cannot be solved using differentiable optimization algorithms. As an 
alternative, we use a subgradient optimization algorithm [13] to produce a sequence of upper bounds 
for Q (A). 

If we denote by Q*, the optimal Lagrangian function value, i.e., Q*= Q(A*)=mhu Q(A), the 
difference ( Q * — J *) is termed the exact duality gap. Since the problem in (10)-(14) is NP-hard [4], 
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we may never know the global optimal solution J*. Instead, we construct several feasible solutions 
to this problem from the Lagrangian function solution, and select the best feasible solution from 
the set. Let J(X*,x/ ,yj) be the best feasible value, then we have, J(X* ,x/ ,yf) < J* < Q “. A nice 
feature of the Lagrangian relaxation method is that the approximate duality gap: 

Q* - J(X*,x f ,y f ) = (Q~ — J*) + (r - J(X’,x f ,y f )) >0 (18) 

provides an overestimate (by the value of the exact duality gap, ( Q * — J’) ) of the error between 
the global optimal solution and the best feasible solution found. Thus, in some cases, even though 
the best optimal solution found is the optimum solution of the problem, the approximate duality 
gap may be nonzero, see Example 1 in Section VI. Based on extensive computational experiments, 
the relative approximate duality gap, 6J, defined by: 


SJ = 


Q’ ~ J{k*,x s ,yS) 

J(k*,x/,yl) 


(19) 


is small for the multiple fault diagnosis algorithms (typically less than 5%). The pseudocode of the 
multiple fault diagnosis algorithm is presented in the next section. 


A. Multiple Fault Diagnosis Algorithm 

Let (*/ 1 yf), Qmini Qub and Qib be the best feasible solution found, minimum upper bound, 
current upper bound and maximum lower bound (function value based on the best feasible solu- 
tion found, i.e., J(x/ ,y*)) for Q(X,x,y), respectively. The pseudocode of multiple fault diagnosis 
algorithm is shown in Figure 3. 
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Initialization: Initialize: (1) Xj = 1 for j = 1, |T/|i, (2) Q m i n = oo, (3) Qib = — oo, and (4) 
set iteration count t — 1. The reason for initializing Xj = 1 is that it results in y* = 0.5. 

Step 1: Find optimum values a:* (A) by solving the set-covering problem in (17). 

Step 2: Find optimum values |/*(A) where yJ(Aj) = j^-u(Xj) for j = 1,..., |T/|. 

Step 3: Evaluate y(x*( A)) using equation (11). 

Step 4: Update x/ , y* , Qmin-, Qub and Qu, as follows: 

• If J(x*( A),2/(z*(A))) > Qib , then x/ = x*(A), y* = y(x*{X)), 
and Qib = J(z*(X), y(x*(X) ) ) , 

• Qub = Q(A,z*(X),f(X)), 

• Qmin = ^hl(Q m i n , Qxib)- 

Step 5: Calculate the subgradient dj = ln(y^-) — a: i(A)Ia(py !i ) + h j } for j = 1, ..., |T/| . 

Step 6: Stop if Y}Jl\ dj = 0 since in this case we cannot define a suitable step size. 

Step 7: Define a step size /? by 0 = _ where initially / = 2. If Q m i n has not 

(Zi=\ d V 

decreased in the last 10 iterations of the subgradient procedure with the current value 
of /, then / is halved. This approach to deciding the value of / is based on the 
procedure of Fisher [6]. The parameter a with typical value 1 < a < 1.1 is to ensure 
that jd does not become too small as the gap between Q u b and Qib decreases [1]. 

Step 8: Stop if / < 0.05 or t > 100 (or any other suitable stopping criteria). 

Step 9: Update the Lagrange multipliers Xj as follows: Xj — max(0, Xj + fidj ) for tj € Tj, 
t <— t + 1, and go to step 1. 


Figure 3: Pseudocode of MFD Algorithm 
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B. Computational Issues 

The computational complexity of MFD algorithm for all steps except the first step is 0(m|T/|). 
It is well known that the set-covering problem is NP-hard [11], and therefore, the first step of the 
multiple fault diagnosis algorithm limits the size of the problem that we can solve. 

One of the important points here is that a near-optimal solution as well as an upper bound 
solution for the set-covering problem can be found via Lagrangian relaxation method [1] in a manner 
similar to the M FD algorithm. Let x n (X) and x u (X) denote the near-optimal (best feasible solution 
found) and upper bound solution for the set-covering problem, respectively. Note that any feasible 
solution for the set-covering problem is a feasible solution for the multiple fault diagnosis problem. 
However, for a given A, the best feasible solution for set-covering may not be the best feasible 
solution for the multiple fault diagnosis problem. Therefore, we have: J{x n (X),y(P l (X))) < J~ < 
< 9(A ! £’ , ‘(A) 5 |/*(A)) < Q(^x u (A) , y* (A) ) ■ Thus, using x u (A) and r n (A), we can generate a sequence 
of upper and lower bounds to the multiple fault diagnosis problem. In this case, the multiple 
fault diagnosis algorithm should be modified as follows: replace the optimal solution r*(A) in the 
algorithm with the near-optimal solution a: n (A), except in Q u b where x*(X) should be replaced by the 
upper bound solution x u (X). By this modification, the computational complexity of this approach 
reduces to 0(m|T/|), and therefore, can be applied to large-scale systems. Note that, because of 
storage complexity of storing Pdij and P fij for all failure sources and tests, the available memory 
of a given computer may limit the largest size of the problem that we can solve. 

In large-scale systems, it is practical to assume that the detection and false alarm probabilities 
of each test tj is the same for all failure sources connected to it, i.e., Pd{j=Pd 3 and Pfij=Pfj, 
if eij € E, otherwise, Pdij = 0 and Pfij= 0. In this case, we define a binary reachability matrix 
R = {r,y} such that r,y = 1 if dj € E, otherwise, r,j = 0. The detection and false alarm 
probabilities of each test tj for each failure source s; can be evaluated as follows: Pdij = TijPdj 
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and Pfij = rijPfj. Note that, in this case, the binary matrix R = rij can be stored in a bit- 
compacted format, and consequently, the storage complexity of the problem reduces by a factor of 
approximately 2 K , where K is the number of bits for representing a floating variable in a given 
computer. For example, the storage complexity of the MFD problem for a system with 10,000 
failure sources and tests when K=32 bits (or equivalently 4 bytes) are 800 Mbytes for storing Pdij 
and Pfij , and 12.5 Mbytes for storing the binary matrix R = { r ij}. However, by storing Pdj, pfj 
and R — {r,j}, the total memory required reduces to 12.6 Mbytes. 

Despite the complexity analysis results for the combinatorial nature of multiple fault problem, 
the optimal solution for this problem can be found via a branch-and-bound. In the branch-and- 
bound algorithm: (1) a binary tree is employed for the representation of the 0-1 combinations, (2) 
the feasible region is partitioned into subdomains systematically, and (3) valid upper and lower 
bounds are generated at different levels of the binary tree. The main objective in a general branch- 
and-bound algorithm is to perform an enumeration of the alternatives without examining all 0-1 
combinations of failure sources. Details of branch-and-bound algorithms can be found in any integer 
programming textbooks, e.g., [8, 13, 17, 18]. 

IV. Ranked Set of Most Likely Candidates 

In this section, we consider the problem of determining a ranked set of solutions to the multiple 
fault diagnosis problem. That is, the problem is to find L sets of most likely candidates. We present 
the following sequential approach to solve this problem: 

Initialization: Find the first most likely candidate X 1 for the multiple fault diagnosis problem. 
Algorithm: DO for 1 = 2, L, or until no feasible solution exists, 

Eliminate the set of candidates { } from the problem and generate 

the 1-th most likely candidate. 
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The first part of the algorithm, i.e., initialization, can be solved by the algorithm of previous section. 
In this section, we present an approach to solve the second part of the sequential algorithm. In this 
approach, we solve a series of modified copies of the initial multiple fault diagnosis. 

A. Ranked Algorithm: Modified Copies of MFD Problem 

In this approach, at each iteration, we solve a series of multiple fault diagnosis problems as- 
suming that the states of some of the failure sources are known prior to diagnosis, i.e., some failure 
sources are known good, and some of them are known bad (definitely faulty). A similar approach 
has been considered by Murty [12] for determining a ranked set of solutions to assignment prob- 
lems, and was recently enhanced by Cox et. al. [5] within the context of multi-target tracking. For 
simplicity, we represent the multiple fault diagnosis problem by four-tuple F=(M FD, G, B , A), 
where 

1. MFD is the problem in (10)-(14), 

2. G C 5 represents the set of known good failure sources, i.e., for all s, £ G, x t = 0 (or s t - ^ A'), 

3. B C 5 represents the set of definitely faulty failure sources, i.e., for all s, £ B , x,- = 1 (or 
. si e A), 

4. X is the optimal solution to the MFD problem subject to G and B. 

Note that the number of unknown failure sources in T=(MFD, G, B, A') is m - |Gj - \B\. Initially, 
G and B are empty, i.e., T 1 = {MFD, 0, 0, A 1 ). Subsequent solutions to F 1 are found by solving 
a succession of multiple fault diagnosis problems that are created from F 1 by a process called 
partitioning. A problem, T, with the best solution A and size m — |Gj — |2?|, is partitioned into a 
set of subproblems, Tj, ..., r m _|< 3 |_|g| +1 , such that: 

• The union of the set of possible solutions to Tj through r m _|G|_|B| +1 is exactly the set of 


possible solutions to T, 
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• The sets of possible solutions to Fj through r m _| G j _| B | +1 are disjoint, and 

• r m _| G |_| S | + i has only one solution X. 

Let us assume that r r is a dummy subproblem that is used to generate the subproblems Fj 
through r m _| G |_| B(+1 from T. The following procedure shows: (1) how to update the subproblem 
r r = (MFD, G T , B T , X r ), and (2) how to make subproblem F, = (MFD, G t , B t , X,) form r r for 
l = — |6j — \B\, sequentially, and finally, (3) r m _| G |_| B | +1 = r r . Initially, F r =F. Then, for 

l = 1 |G| — \B\, T T is partitioned as follows: 

• Select any s; € S - (G r U B r ), 

• If s,- € X, then Gi G r U{s,-} and B T - B T U {s { }, else B t - B r U{s,} and G T *- G T U{s t }. 

Note that, at each iteration, the problem F r is partitioned into two disjoint subproblems. This is 
because we force the subproblems to be different in the status of only one failure source S{ in the 
system, i.e., we add S{ to the set of definitely faulty failure sources in one subproblem, and to the 
set of known good failure sources in another subproblem. In addition, X cannot be a solution to 
Vi for l = l,...,m— | G\ — \B\. Further more, r r is the only subproblem which contains X and only 
X as its solution. This is because B T = X and G r =S — X. 

As an illustration, let us consider a simple system with 3 failure sources {si, sj, S 3 }. In addition, 
let us assume that the optimal solution for the MFD problem in this case is A={si}, i.e., T 1 = 
(MFD, 0, 0, A ' 1 = {ax}). Therefore, the MFD problem can be partioned into the following 
subproblems; T x = (MFD, G = {s x }, B = 0, A x ); r 2 = (MFD, G = 0, B = {s u s 2 }, X 2 ), V 3 = 
(MFD, G = {s 2 }, B = { Sl ,s 3 }, X 3 ), and T 4 = (MFD, G = {s 2 ,s 3 }, B = { Sl }, X 4 ). 

Therefore, we partition T 1 according to its best solution X 1 , and place the resulting subproblems 
together with their best solutions, except the last one, i.e., r m _| G |_| B | + i, on a priority queue of 
four-tuple (MFD, G, B, X). We then find a problem in the queue that has the best solution. The 
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solution of this problem is the second-best solution to the multiple fault diagnosis problem. Now, 
we remove this problem from the queue and replace it by its partitioning. The best solution found 
in the queue now is the third-best solution to the multiple fault diagnosis problem, and so on. The 
pseudocode for the L- ranked algorithm is shown in Figure 4. 


Initialization: Find the first solution X 1 to MFD problem, and initialize a priority queue of 
four-tuple problems to contain only T 1 =(M FD , 0 , 0 , X 1 ). The top problem on this queue will 
always be the problem with the highest likelihood solution. 

Step 1: Clear the list of solutions to be returned. 

Step 2: DO until priority queue of problems is empty. 

Step 2.1: Take the top problem T=(MFD , G, B, X) off the queue. 

Step 2.2: Add X to the list of solutions. 

Step 2.3: If the cardinality of solution set is L, Stop. 

Step 2.4: Let T r =T, 

Step 2.5: DO for l = 1 ,...,m— |<j| — |jB|, 

Step 2.5.1: Partition T T into T r and r* as follows: 

Step 2.5.2: Select any s, £ S — ( G T U B T ), 

Step 2.5.3: If s t - 6 X, then G' <— G T U {s,} and B T <— B T U {s,}, 
else B' <— B T U {.§,•} and G T <— G T U {s,}. 

Step 2.5.4: Find the best solution X' to r\ If X' exists, add (MFD, G ' , B' , X') to 
the queue. 

END 

END 


Figure 4: Pseudocode for L-Rank MFD Algorithm 
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Since each subproblem is NP-hard, we use the near-optimal MFD algorithm of previous section 
to solve the ranked set problem near-optimally, i.e., X is a near-optimal solution for the problem 
T=(M FD, G, B, X). Thus, it is possible that Z-th solution, i.e., X 1 , has higher likelihood than 
the fc-th solution, i.e., X k , where k > l. Note that, we perform one partitioning for each of the 
L-best solution, in the worst case, each partitioning creates 0(m) new problems. This creates up to 
0(Lm ) multiple fault problems and insertions on the priority queue. Each problem takes at most 
0(m|T/|) time to solve near-optimally, and each insertion takes at most 0(\og(Lrn)) time. There- 
fore, the worst-case execution time of this approach is 0{Lm{m\Tj\ + log(l/m))), or approximately, 
0(Lm*\T f \). 

V. Multiple Fault Diagnosis with Repetitive Tests 

A reasonable and common situation in unreliable testing is to apply a test several times to 
improve the confidence about a given hypothesis (a set of multiple fault candidates). For example, 
in order to reduce the probability of error, i.e., false alarm and missed detection of some faults 
(disorders or diseases), a system (a patient) may be tested multiple times, and because of imperfect 
nature of tests, the test results may be different. In this section, we assume that each test tj has 
been applied rij times in which it passed and failed \i 0 and rjj times, respectively, i.e., rij = fij - f- 
Tjj. Note that applying a test at different times is equivalent to applying independent tests with 
the same structure. In this case, let us assume that Tj and T v denote the set of failed and passed 
tests (without any redundancy), respectively, and Tj (1 T v may not be empty. Thus, the problem 
is: 


max J(x,y) = J?;ln(l - yj) + 

-- her, 

PkM=Y L ) + MPi)} 

t= 1 tt.PT., ik 


subject to (11)-(14). This problem is similar to the problem in (10). Thus, the algorithms in 
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previous sections can be readily applied to solve this problem. In this case: (1) in the first step of 
the MFD algorithm, c,(A) is a function of (Xk for k = 1, |T P |, i.e., the number of time that test tk 
passed, and (2) in the second step of the MFD algorithm, the optimum of the objective function 
with respect to y is replaced by y*(\j) = i5-+Aj u ( A i) for -? = i, |Ty|. 

VI. Examples 

Example 1: In this example, we consider: (1) a simple diagnostic problem with m = 20 failure 
sources (disorders) and n = 20 tests (manifests) which was used as an example in [3]; ( Example l.a 
- l.d), and (2) a diagnostic problem with to = 15 failure sources and n = 10 tests from [9]; (Example 
l.e). The false alarm probabilities for these systems are all zero, i.e., Pfij = 0 for i = 1, ...,m and 
j = 1, ...,» and T p — T - Tj. Figures 5 and 6 show the failure source and detection probabilities 
for Example (l.a) through (l.d), and Example (l.e), respectively. Figures 7 and 8 show: (1) the 
set of failed tests T/, (2) diagnostic results, (3) likelihood, (4) processing time and total number of 
runs to converge to the diagnostic results, (5) total processing time and total number of runs, and 
(6) approximate duality gap. The diagnostic results are based on the near-optimal multiple fault 
diagnosis algorithm in Figure 3. The processing times for these examples are obtained by running 
the MFD algorithm on a SPARC 10. Binglin et. al. [3] presented a genetic algorithm which 
required 10 minutes to find the set of diagnoses in Example l.a with an IBM PS/2 Model 40 SX-20 
MHZ microcomputer. These results show the superior performance of our algorithm compared to 
the algorithm in [3]. Miller et. al. [9] have not reported the processing time for Example l.e. 
However, the largest problem that they considered contained 20 failure sources and 15 tests. 

Example 2: In this example, we consider systems with: (1) m=n=100, m=n=500 and m=n=1000, 
(2) the probability of each failure source is set to a random number between (0.001, 0.5), (3) each 
test, on average, covers 5, 10 and 20 failure sources, (4) detection probabilities of a test associated 
with its covered failure sources are set to random numbers between (0,1), (5) the false alarm prob- 
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Figure 5: Probabilities for Example l.a-l.d 
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Figure 6: Probabilities for Example l.e 
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Ex. 

{j\tj€T f } 

{si|s,- €X} 

ProbCA'ITy.Tp) 

l.a 

{1,2,4,5,7,8,13,15} 

{1,9,10, 14,17} 

3.66e- 09 

l.b 

{7,8,9,11, 14, 15} 

{4,5,17,20} 

1.32e _l ° 

l.c 

{1,3,4,6,7,11,13, 15, 16} 

{1,5,9, 14,16,17} 

6.82e -13 

l.d 

{1,2,3,7,8,12, 13,17} 

{4,5,8,14,19} 

2.49e -09 

l.e 

{1,2,4,5,7,8,9,10} 

{3,4,9, 12,13} 

7.77e-° 2 


Figure 7: MFD Algorithm Results for Examples l.a-l.e 


Ex. 

Convergence 

Total 

Approximate 


# Runs 

Time (sec) 

# Runs 

Time (sec) 

Duality Gap 

l.a 

8 

0.170 

58 

0.310 

4.68% 

l.b 

2 

0.009 

65 

0.240 

4.76% 

l.c 

2 

0.050 

68 

0.340 

4.69% 

l.d 

1 

0.004 

64 

1.83 

4.69% 

l.e 

2 

0.007 

58 

0.15 

4.52% 


Figure 8: MFD Algorithm Results for Examples l.a-l.e 
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abilities are assumed to be zero, and (6) the number of failed tests are 5, 10 and 20. Figures 9, 
10 and 11 show the simulation results for these systems. Each row of these Figures represents the 
average of simulation results for 5 randomly generated systems. Note that, in most of the cases, the 
average approximate duality gaps are around 5%. However, in some of the cases, for example, the 
last row of Figure 9, the approximate duality gap is very large, i.e., 22.15%. In order to improve 
the solution (or, equivalently, approximate duality gap), we can apply the A- ranked algorithm. The 
average approximate duality gap based on 2-ranked algorithm for the last set of systems in Figure 
9 reduces to 1.49%. 


Average 


Convergence 

Total 

Approximate 

Test 

\T f \ 

# Runs 

Time 

# Runs 

Time 

Duality 

Coverage 



(sec) 


(sec) 

Gap 

5 

5 

2 

0.10 

62 

2.70 

4.25% 

5 

10 

16 

0.38 

63 

3.11 

3.75% 

5 

20 

4 

0.83 

70 

13.78 

4.79% 

10 

5 

3 

0.25 

63 

3.35 

6.75% 

10 

10 

12 

2.91 

60 

12.00 

5.26% 

10 

20 

9 

3.01 

83 

31.00 

9.59% - 

20 

5 

1 

0.11 

55 

6.97 

6.12% 

20 

10 

2 

0.85 

58 

28.81 

10.96% 

20 

20 

16 

23.39 

59 

90.72 

22.15% 


Figure 9: Simulation Results for m=n — 100 

Example 3: In this example, we consider three systems with 10 failure sources and 10 tests as 
in [15]. The false alarm probabilities are assumed to be zero. The simulation results for 2 10 possible 
combinations of test results are shown in Figure 15. The second column shows the number of correct 
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Average 


. Convergence 

Total 

Approximate 

Test 

\Tj\ 

# Runs 

Time 

# Runs 

Time 

Duality 

Coverage 



(sec) 


(sec) 

Gap 

5 

5 

1 

0.54 

62 

23.56 

4.15% 

5 

10 

3 

1.62 

72 

44.58 

3.96% 

5 

20 

16 

7.15 

66 

32.77 

4.23% 

10 

5 

1 

0.48 

58 

26.99 

4.82% 

10 

10 

1 

1.20 

69 

46.68 

4.60% 

10 

20 

15 

19.25 

67 

85.37 

6.60% 

20 

5 

1 

0.56 

51 

30.41 

3.75% 

20 

10 

6 

6.55 

64 

60.38 

3.00% 

20 

20 

11 

26.19 

64 

184.06 

16.03% 


Figure 10: Simulation Results for m=n=500 

cases out of 1024 possible combinations of test results. The third column shows the weighted 
probability of correct cases. The columns corresponding to Nd and Nj denote the unweighted 
probabilities of detection, i.e., the unweighted probability of common faulty failure sources in the 
optimal and near-optimal solutions, and false alarm, i.e., the unweighted probability of faulty failure 
sources in the near-optimal solution and not in the optimal solution. Figure 16 shows the simulation 
results based on the 2-ranked algorithm. The average weighted (unweighted) accuracy based on the 
MFD algorithm and 2-ranked algorithm are 97.71% (94.99%) and 99.96%(99.77%), respectively. 

Example 4: In this example, we consider the medical example in [14, 19]. The system under 
consideration is for neuropsychiatric diagnosis. The system consists of 26 disorders (failure sources) 
from psychiatry and neurology which affect mental status. A list of 56 symptoms (tests) and signs 
was assembled for each disorder. There are 384 links in the system, each of which connects a 
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Average 


Convergence 

Total 

Approximate 

Test 

P/| 

# Runs 

Time 

# Runs 

Time 

Duality 

Coverage 



(sec) 


(sec) 

Gap 

5 

5 

2 

3.39 

67 

102.09 

5.89% 

5 

10 

2 

3.53 

73 

122.88 

5.12% 

5 

20 

2 

5.02 

67 

138.56 

4.59% 

10 

5 

1 

2.27 

54 

87.95 

4.09% 

10 

10 

2 

3.83 

55 

99.53 

4.27% 

10 

20 

5 

11.90 

66 

169.60 

4.99% 

20 

5 

1 

3.26 

53 

103.29 

4.28% 

20 

10 

11 

30.29 

57 

137.15 

4.94% 

20 

20 

28 

139.14 

76 

374.54 

9.42% 


Figure 11: Simulation Results for m=n=1000 

disorder to a manifestation. Similar to [14], five groups of test cases are used to test the MFD 
and Z-ranked algorithms. Manifestations are chosen randomly from the total set of 56 possible 
manifestations based on a uniform distribution. Each group of test cases consists of ten different 
sets of manifestations. Each case in the first test group has one present manifestation (failed test); 
each case in the other groups have 3, 5, 7 and 9 manifestations. If any randomly generated test 
result is inconsistent with the causal network, the case is discarded and a new one is generated. 
The inconsistent test results may occur because the causal network used in the experiment has 
some perfect tests, i.e., Pdij — 1 and P fij = 0. Thus, after applying the first Lemma, and reducing 
the size of the problem, the second Lemma may not be satisfied, i.e., there exists a failed test that 
is not covered by any failure source. Simulation results show that among all 50 cases MFD and 
2-ranked algorithms generate 98% and 100% optimal solutions. Peng and Reggia applied their 
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p(aj)= 0.026 

p(ae)s= 0.023 

0.014 
p(sj)ss 0.048 

p(4 3 )= 0.054 
p(s R )= 0.079 

p( s 4 ) = 0.060 
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Figure 12: Failure Source and Detection Probabilities for Example 3. a 


,<4!)= 0.170 
p(46)= 0.180 

P( 5 2 ) = 0.070 
p(j 7 )= 0.075 

p(i 3 )= 0.030 
p(s 8 )= 0.030 

p(j 4 )= 0.120 

PC*9)= °* i4 ° 

p(4 S )= 0.135 
p(sj 0 )= 0.050 

Pd 1>2 = 0.06 

Pdi 4 = 0.68 

Pd 16 = 0.10 

Pdy ^ 7 a: 0.51 

Pd 2jl as 0.53 

Pd 2> 3 = 0.81 

Pd 2>4 = 0.09 

pd 2,S~ °* 85 

Pd 2,S~ 0.13 

pd 2,9 ~ 0.34 

Pd 2,10 = 0.85 

pd 3,2 = 054 

pd z,s— °- 45 

Pd 3 _ 6 = 0.90 

Pd 3>7 = 0.59 

p< *3,10 = O ' 29 

Pd 4i2 “ 0.74 

Pd 4j3 =: 0.52 

Pd 47 = 0.65 

Pd 4f9 = 0.32 

Pd 5i3 = 0.72 

Pd Sj g= 0.49 

Pdg i3 = 0.09 

Pd 6>5 = 0.66 

p< * 6 , 10 = 044 

Pd 7j 3 = 0.22 

Pd 7j4 = 0.46 

Pd 7j5 =s 0.21 

Pd 7t6 = 0.76 

pd 7,10 = 0.43 

Pd 81 as 0.29 

Pd S,2 = 034 

Pd S ,8 — 0.25 

Pd 9)I = 0.39 

Pd 94 sa 0.20 

Pd 9 j,= 0.90 

Pd 9,6 = °- 48 

Pdg 7 = 0.38 

P< 1 10 , 2 = 074 

P< 1 10,8 = °- 27 


Figure 13: Failure Source and Detection Probabilities for Example 3.b 


)= 0.34 
p(4 6 )= 0.36 

p(a 2 )= 0.14 
p(a 7 )= 0.30 

p{4 3 )= 0.06 

Pt> 8 )= 0-06 

p(‘i)= 0.24 
P<4 9 )= 0.2* 

p(4 5 )= 0.27 
p(410)= 0-10 

Pd lj2 = 0.06 

Pdi f4 = 0.68 

Pd 16 = 0.10 

Pd lj7 s= 0.51 

Pd 2,l~ 0.53 

Pd 2j3 = 0.81 

Pd 2|4 = 0.09 

Pd 2> 5= 0.85 

Pd 2>8 = 0.13 

Pti 2,9 = °- 34 

pd 2,10= °.85 

Pd 3|2 = 0.54 

Pd 3i5 = 0.45 

Pd 3>6 = 0.90 

pd 3,7 = °- 59 

p< *3,10= °* 29 

Pd 4|2 = 0.74 

Pd 4 ^ 3 = 0.52 

Pd 4|7 = 0.65 

Pd 4j9 = 0.32 

Pd Sj3 = 0.72 

Pd 5i g= 0.49 

Pdg i3 = 0.09 

Pd 6,5 = 0.66 

p< *6,10 = °* 44 

Pd 7t3 = 0.22 

Pd 7>4 = 0.46 

Pd 7|5 = 0.21 

Pd 7> 6 ss 0.76 

p< *7,10 = °* 43 

Pd 8|1 = 0.29 

Pdg i2 = 0.34 

Pdg i8 = 0.25 

Pd 9|1 ^ 0.39 

Pd 9t 4 = 0.20 

Pd 9i5 = 0.90 

p <*9,6 = 0,48 

Pd 9,7 = 0.38 

P< *10,2 = 0,74 

Pd 10 ,g= 0.27 


Figure 14: Failure Source and Detection Probabilities for Example 3.c 
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Example 

Correct cases 

N d 

Nj 

# Times (out of 1024) 

Weighted 

3. a 

992 (96.88%) 

99.91% 

98.63% 

0.39% 

3.b 

971 (94.82%) 

98.61% 

97.60% 

0.31% 

3.c 

955 (93.26%) 

94.61% 

97.66% 

0.37% 


Figure 15: MFD Alg. Results for Examples 3.a-3.c 


Example 

Correct cases 

N d 

N f 

# Times (out of 1024) 

Weighted 

3.a 

1024 (100%) 

100% 

100% 

0.00% 

3.b 

1019 (99.51%) 

99.92% 

99.77% 

0.06% 

3.c 

1022 (99.80%) 

99.95% 

99.92% 

0.03% 


Figure 16: 2-ranked Alg. Results for Examples 3.a-3.c 
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competition-based connectionist methods to this causal network. Their algorithm generated 74% 
of globally optimal solutions, and 90% of one of the three globally optimal solutions. 

VII. Conclusion 

In this paper, we considered the problem of constructing optimal and near-optimal multiple 
fault diagnosis in bipartite systems with unreliable (imperfect) tests. We presented a multiple fault 
diagnosis algorithm based on Lagrangian relaxation and subgradient optimization method, which 
provides near optimal solutions for the multiple fault diagnosis, and upper bounds for an optimal 
branch-and-bound algorithm. Computational results indicate that our algorithm can be used in 
systems with as many as 1000 faults. In addition, we presented an algorithm to generate the set 
of L-ranked multiple fault candidates. In this algorithm, we find the most likely candidate using 
the near optimal multiple fault diagnosis algorithm. Then, we partition the problem, based on 
the first solution, to a set of disjoint subproblems. The solutions to these subproblems with the 
highest likelihood represents the second most likely candidates. This procedure is continued until 
T-ranked multiple fault diagnoses are found, or no more feasible solutions exist. We showed that 
the computational complexity of this approach is 0(Lm?\Tf\), and therefore, applicable for systems 
with as many as 1000 faults and tests. Finally, we extended the multiple fault diagnosis problem to 
redundant or repetitive tests. In this case, the problem is very similar to the original multiple fault 
diagnosis problem, and therefore, the MFD algorithm can be extended to this problem as well. 

In this paper, we assumed that the test results are known prior to diagnosis. That is, we 
considered the problem of multiple fault diagnosis with unreliable tests. The problem of sequential 
multiple fault diagnosis strategy (testing) with unreliable tests is an important problem in field 
maintenance. Furthermore, the order of partitioning in the i-ranked algorithm may improve the 
accuracy of the near-optimal solutions. We expect to investigate these challenging issues in our 


future efforts. 
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Abstract 

In this paper, we consider imperfect test sequencing problems under single fault assumption. This 
is a partially observed Markov decision problem (POMDP), a sequential multi-stage decision problem 
wherein the states are the set of possible failure sources and information regarding the states is obtained 
via the results of imperfect tests. The optimal solution for this problem can be obtained by applying a 
continuous state Dynamic Programming (DP) recursion. However, the DP recursion is computationally 
very expensive owing to the continuous nature of the state vector comprising the probabilities of faults. 
In order to alleviate this computational explosion, we present an efficient implementation of the DP 
recursion. We also consider various problems with special structure (parallel systems) and derive closed 
form solutions/index-rules without having to resort to DP. Finally, we consider various top-down graph 
search algorithms for problems with no special structure, including multi-step DP, multi-step information 
heuristics and certainty equivalence algorithms. We compare these near-optimal algorithms with DP for 
small problems to gauge their effectiveness. 


1 Introduction 

An important issue in the field maintenance of systems is the imperfect nature of tests due to improper 
setup, operator error, electromagnetic interference, environmental conditions, or aliasing inherent in the 
signature analysis of built-in-self-tests. Typically, a user complaint, which is a subjective measure of system 
performance, can also be considered as an imperfect test because it does provide some insight into the 
malfunction. Imperfect testing introduces an additional element of uncertainty into the diagnostic process: 
the pass outcome of a test does not guarantee the integrity of components under test (because the test may 
have missed a fault), or a failed test outcome does not mean that one or more of the implicated components 
are faulty (because the test outcome may have been a false alarm). 

The consequences of a test error depend on the disposition of the system after repair. If a test results in a 
false alarm, a functioning component is replaced, and a failed component may be left in place. If the system 
is then returned to service, the system fails immediately. In the case of missed detection by a test, the overall 
test could indicate that no item has failed. In this case, the system might be returned to service where it fails 
immediately or it might be scrapped. Either choice implies a cost. Relatively little attention has been given 
to imperfect testing. Most research efforts were directed at finding test strategies for systems with special 
structure (parallel systems). The most complete treatment for parallel systems with imperfect tests is by 
Firstman and Gluss [1] in which a two level testing is studied with both false alarms and missed detections 
in tests. However, it is assumed that test errors are ultimately recovered by repeating the tests until a 
proper repair is made. The test sequence is then determined in the same manner as for perfect testing. The 
perfect-test rechecks assures test termination with proper repair and thus fails to capture the fact that test 
errors are often unrecoverable. For many systems, imperfect test results cannot be recognized either because 
of the test design or because retesting is economically infeasible. In these cases, the consequences of test 
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errors occur outside of the repair facility. Nachlas and Loney [2] presented the problem of test sequencing 
for fault diagnosis using unreliable tests for parallel systems. The objective of the fault diagnosis problem is 
to minimize the expected cost required to diagnose and repair the failed component. They present heuristic 
algorithms based on efficient enumeration of permutations of test sequences, which are not suitable for large 
problems with arbitrary structures. These problems belong to a class of hypothesis testing problems with 
dynamic information seeking. Problems in dynamic search arise in a wide variety of applications [10] [11] 
[12] [13] [14]. Dynamic search in the context of sequential detection was extensively treated by Wald [18] In 
[15], [16], [17], different search problems in the presence of false alarms were considered. 

In this paper, we consider a generalized formulation of the test sequencing problem in the presence of 
imperfect tests for systems of arbitrary structure. The test sequencing problem in this case is a partially 
observed Markov decision problem (POMDP) [8] [9], a sequential multi-stage decision problem wherein the 
states are the set of possible failure sources and information regarding the states is obtained via the results 
of imperfect tests. The optimal solution for this problem can be obtained by applying a continuous state 
Dynamic Programming (DP) recursion. However, the DP recursion is computationally very expensive owing 
to the continuous nature of the state vector comprising the probabilities of faults. In order to alleviate 
this computational explosion, we present an efficient implementation of the DP recursion. We also con- 
sider various problems with special structure (parallel systems) and derive closed form solutions/index-rules 
without having to resort to DP. Finally, we consider various top-down graph search algorithms for problems 
with no special structure, including multi-step DP, multi-step information heuristics and certainty equiv- 
alence algorithms. We compare these near-optimal algorithms with DP for small problems to gauge their 
effectiveness. 

2 Optimal Test Sequencing with Imperfect Tests 

In its simplest form, the test sequencing problem with imperfect tests is as follows: ' 

1. A system with a finite set of failure sources 5 = {so, «i, ^ 2 , • • .,s m ] is given. We make the standard 
assumption that the system is tested frequently enough that only one or none of the faults has occurred. 
The ” no-fault” condition is denoted by a dummy failure source s 0 ; 

2. The a priori probability of each failure source, p(s,-) is known; 

3. A finite set of n available tests T = {h, 4, . . . , t n } are given, where each test tj cliecks a subset of 
failure sources. The relationship between the set of failure sources and the set of tests is represented 
by a reachability matrix R = [r,-j], where r,j = 1 if test tj monitors failure source s,-; 

4. The reliability of each test tj is characterized by the detection-false-alarm probability pair (P^j , Pfj)\ 
where P = Prob{test tj fails | any of the failure sources monitored by tj has- failed}, and Pfi = 
Prob{test tj fails | none of the failure sources monitored by tj has failed}; 

5. Each test tj(l < j < n) costs an amount Cj measured in terms of time, or other economic factors; 

6. Each failure source s t (l < i < m), once identified has repair/replacement cost /,-, false repair/replacement 
cost Cm, and missed repair/replacement cost Cm associated with it. 

The problem is to design a test algorithm with minimum expected diagnostic cost to isolate the failure 
source, if any, with a specified level of confidence a (typically, a G [0.95,0.99]). Employing the single fault 
assumption, the reachability matrix R, and the test reliabilities ( P<n,Pjj ) can be combined into a single 
matrix of ’’likelihoods”, D = [d tJ ], where d, ; is given by 

dij = r ijPdj + (1 — r ij)Pfj , (1) 

where dij = Prob{ test tj fails | failure source s,- has occurred }. 

When tests are perfect, that is, Pjj = 1 - P n = 1 for all tests, we have dij = r,-y . This corresponds to 
a perfectly observed Markov decision problem, and has been discussed extensively in [7]. The solution to 


'Extension to the case when (Pdj, Pfj) are functions of failure source s; is straightforward. 
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this problem is a diagnostic decision tree, wherein the root corresponds to the state of complete ignorance, 
the intermediate nodes relate to states of residual ambiguity and the leaves correspond to individual failure 
sources. The test algorithm terminates when the failed element is isolated with complete certainty (that is, 
a = 1 ). 

When the tests are imperfect, the test sequencing problem is a partially observed Markov decision problem 
(POMDP), a sequential multi-stage decision problem wherein the states are the set of possible failure sources 
and information regarding the states is obtained via the results of imperfect tests. It can be shown [4] 
that the probabilities of failure sources conditioned on all the previous test results constitute a sufficient 
statistic (i.e., contain all the necessary information) for deciding the next test to be applied. Formally, let 
j(k) €E {l,2,...,n} be the test applied at stage k and let O(k) € {1(= pass),0(= fail)} be the outcome of 
test tj(k)- Further, let It~i be the information available to decide on test tj^) to be applied at stage k. This 
information includes all the past tests applied and their outcomes given by: 


= 0 l . 


( 2 ) 


Using Bayes’ rule, the conditional probabilities of hypotheses {ir,-(fc) = p(s,|/jfc) £ = 0,1,.. . , m}, which 
are the information states of the decision process at each stage k, can be shown to evolve as 


T,7=olO(k) + (-i)°Md, m }Mk)Y 


( 3 ) 


The above recursion is initiated with 7r,(0) = p(s,), i = 0, 1 . . . , m, the a priori probability distribution of 
failure sources. The optimal test tj(t) is given by the dynamic programming (DP) recursion [4]: 


h*({m(k)}) = min 

j(t)e{l,2,...,n} 


c j(k) + ,l ~ 


-4ij(k)*i(k) N 
IXo d ij(k)Ki{k) ) 


E(i >m)Mk)J h (^m o(1 _ d , m)Mk) ) \ 


( 4 ) 


where Cj(k) is the cost of test tj(k), h* ({ni(k)}) is the optimal expected cost-to-go from the information state 
(?fi(A:) : i = 1,2, ...,m}, the terms involving h* inside the brackets are the optimal costs-to-go' from the 
information states corresponding to the fail and pass outcomes, respectively. The terminal states of this 
recursion have known cost : 


m 

^*({ 5r «'}) = fi' + (1 ~ *i‘)CRi< + E Cum (5) 

where 

i' = arg max x,- (6) 

i 

This definition of terminal cost function corresponds to the policy of repairing the most likely fault. Since 
{jr,} are continuous, the above DP recursion is continuous. Thus, the consideration of imperfect tests in 
the test sequencing problem formulation converts a finite (albeit large) dimensional search problem of the 
perfect test case into an infinite dimensional stochastic control problem. 

3 Systems of Parallel Structure 

Parallel systems are characterized by a reachability matrix R with ones on the diagonal and zeros every- 
where else, for some permutation of tests. That is, every failure state is detected by one, and only one test. 
For parallel systems, we can explicitly characterize the optimal policy in the perfect test case: at each slate 
of ambiguity, test a module with the highest ratio of probability of failure and the cost of testing the module. 
For the imperfect testing case, such a closed form solution cannot be obtained without making additional 
assumptions. However, in the following subsections, we derive closed-form solutions for some special cases. 
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3.1 Special Case 1: 

Let us specialize the above to a parallel system with the following assumptions: 

• all test costs are equal 

• a test can be applied more than once 

• a fault is implicated if its posterior probability at any stage of testing exceeds a given threshold y 
(typically, y € [0.95, 0.99]) 

Given this problem context, we first consider a greedy one-step lookahead strategy that maximizes the 
posterior probability of correct decision assuming that a decision would be taken at the next stage implicating 
the failure with the maximum posterior probability (MAP decision rule). 

Let a and denote the false alarm and missed detection probabilities of tests. Let tt,-(L) denote the 
conditional probability of the failure source s,- at k — th stage of testing (stage 0 is when no tests have been 
applied). Let |s,) denote the conditional probability density of the outcome Oj E {0,1} of test tj given 

that S{ is present. Given the nature of the tests, we further know that, 


f (n ,, x _ f g{Oi) = PS(Oi) + (1 - mOi - 1) for i 
| _ (j _ a )6(Oi) + aS{Oi - 1) for * 


= J 

± j 


(<) 


where 6(.) is the Dirac Delta function. 2 

Let us assume that test tj k is the next test to be applied at stage k. Then, since the greedy approach 
corresponds to the assumption that the next test is the final one, the decision rule at the next stage is to 
implicate the failure source sj such that: 


which translates to, 
Let us define: 


ira(k + 1) = max7r,(Jt + 1) 


*d(k)f djk (0(jk,k)\s d ) = max {ni(k)fij k (0(jk, &)!«»)} 


P(C\si, jk) = Prob(Correct Decision |js,- , J*) 

P(C\si,j k ) = Pr ( Tri(k)g(0(jk,k)) > max{ir,(k)h(0(jk , k))} 

We can simplify the above equation as, 

P(C\s u j k ) = Pr{wi(k)h(0(j k ,k)) > 

max {TTi(k)h(0(j k ,k))}, *j k (k)g(0(jk,k))\ 1 for j k ^ i 


( 8 ) 

( 9 ) 

( 10 ) 

(11) 


max 


In order to simplify the above two equations, we define: 

m = arg max *-,•(&) 
m = argmax7r,(I:) 

t^m 


( 12 ) 

(13) 

(14) 


2 The function fi(x) is defined via: 


j: 


S(x) = 0 V x ^ 0 
S(x)dx = 1 
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Then we have for i = m: 


and for i ^ m, 


where 


for jk = m 

*m(b) 

= 1 - WM ) for 3 k 1 m 
(A J 


p(c\ Si j k ) = QA^rnl) ^ h = i 

5T m(*j 

= 0 for jk ± i 


Q g (a) = Pr(g(x) > ali{x) j PDF of x is (?(«)) 
Q h (a) = Pr(ff(x) > ah(x) | PDF of x is h{x)) 


( 15 ) 


( 16 ) 


( 17 ) 

( 18 ) 


Therefore, we have: 


P(C\j k ) = M*K?,(z^§) + »*(*)( i - Q4?48)) for j k = m 




^rn(^) 

^m(^) 

^rn(^) 


^ *m{k) ' 

j + 7Tm(fc)(l - Qfc(^S)) for jk = m 


*jk\ K ) 


(^-) 

Wm(k) 

*j k { k ) 


)) for all other jk 


Now an index jk is to be chosen so that the above expression is maximized. In the following, we will show 
that P(C\jk) for jk other than m and m is less than P(C\m). Let us transform Q g (.) and Qh{ ) by forming 
the likelihood ratio: 


A = 


9(z) 

h(x) 


( 19 ) 


Assume that if x has a PDF h(x), then A has a probability density function fh( A), distribution function 
F/,(A), and integral of the distribution function ^(A). That is, 


^ k (X) = F h (X),^F k (X) = f h (X) 


( 20 ) 


Similarly, assume that if x has a PDF of g(x), then A has a probability density f g ( A), distribution function 
F g ( A), and integral of the distribution function 'F^(A). It follows that: 


Qh(a) = 1 - F h (a), and Q g (a) = 1 - F g (a) 


It can be easily shown that: 
and that 

Now, let us define 
and note that, 


f g w 

A(A) 


= A 


Q g (a) = 1 - / A/fc(A)dA = 1 - aF h (a) + **(o) 

J — OO 


f{y . ) = 1 -<3h(^) + y<3 s (^) 


P{C\jk) = ^m(fc)y( ^4i r) for jk 7 *- m and j k # m 


( 21 ) 

( 22 ) 

( 23 ) 

( 24 ) 

( 25 ) 
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We have, 

<p{y) = V+ (26) 

and y>(0) = 1. Now, 

^(») = (27) 

Hence, we know that <p(y) is a non-decreasing function, and that its minimum value is at y — 0 which is 
unity. 

Therefore, 

P(C\jk) = for Jk # m and j k ± m (28) 

Since <p(y) > Q s (x) for 0 < x,y < 1, the conditional probability P(C\j k ) is maximized by choosing j k = m. 
However, P(C\m ) can be greater than P(C\fn), and we have to check this condition. Now suppose we define 
the Binary Bayesian Equal-cost hypothesis test, where the hypotheses are that the measurement x has come 
either from a probability distribution /(.) or from g(.) with and ir g being the associated priors. We can 
write the corresponding minimal probability of error as, 


>•’ (*■/>) = *hQh( tt~t) + (! - 7r *)( i “ Q§( >r~ J ) 


Note that 


and 


P(C\m) 


(1-**)' v /v 
1 “ " { ™ w }] + 


p (CN = [:-,«{) - 5^^}] (nK*) + ».(*)) 

Hence, the optimal strategy is to set j k = m if 


(29) 

(30) 

(31) 


f *m( k ) } -fj *m(k) 1 

\ Krn(k) + TTrh(k) J \ TT W (k) + TTfh(k) J 


Otherwise, set j k = m. 

The above decision rule can be further simplified by substituting the exact expressions for Q g ( .) and 
Qh{- )- That is, the optimal strategy is to set j k — rh if 


%(1) [q,(— 

Km 


(t) ) + «*(r#) -i 


Xm(k) 


^rnik)' 


+ ?Tm(k) 


i n / x m(k)^ ^ 7Trf,(fc) ] 

1 - QaK—nr) - QhK—prrV 


' ~m(k) ‘ 


^m(k) \ 


> 0 


By combining the terms, and expanding Q g (.) and Q /,(.), the above comparison can be written as, 

1 - {(1 - a + m^r- > -tit) + (! + «- > 0 


*>«(*)' 


1 - a nm(k) ' 


(32) 


(33) 


where the indicator function 1(E) = 1, when the logical expression E is true, and zero otherwise. This 
implies that, when both missed detections and false alarms are present, the optimal policy is to test the 
fault with highest posterior probability if the above expression is not greater than zero. Otherwise, the fault 
with the second highest posterior probability should be tested. Note that, when the probability of missed 
detection P is zero, then the above expression is always greater than zero, implying that the decision rule is 
to choose j(k) = m, i.e. test the fault with second highest posterior probability. 


3.2 Special Case 2: 

Let us consider another special case involving a parallel system with the following assumptions: 
• the test costs {cj , C 2 , ■ ■ . , c m } are known 
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• a test cannot be applied more than once 

• a fault is implicated when a test detecting it fails 

• the tests have no false alarms 

• the missed detection probabilities {/?, } are known 

Given this scenario, we now proceed to show that the optimal test sequence is an index rule. 

Let S* = represent the index set of the optimal test sequence that minimizes the 

expected testing cost. Before writing down the expression for the expected testing cost, let us consider the 
testing strategy described above in detail. The first test to be applied is and this test is always applied. 
The second test tj( 2 ) is applied under one of the following two situations: 

1. the component Sj(i) is not faulty (hence < ; (i) would not fail), or 

2. the component Sj(i) is faulty but the test lj^ missed detecting it (the probability of this event is 0j(i)) 

Similarly, the third test tj(z) in the sequence is applied if sj(i) and s ; ( 2 ) are not faulty or if the tests ty ( ^ 
and tj( 2 ) missed detecting them. 

Now, the above discussion lets us write the expression for the expected testing cost as, 

E[J(S’)] = c i(1) + c i(2) (l -p(sj(i)) + p(«;(i))/?i(i)) + •• • 

<?<*)( 1 “ 

k = i k 

Let S' = {j'(l), j'(2), . . ., j'(m)} be another sequence of tests obtained from S‘ by interchanging terms 
k and k + 1. That is, 



k - 1 




i=l 


(34) 


f(i) = j{i ) for i ^ k and i ^ k + 1 

/(*) = K k + 1 ) 

f(k+i) = j(k) 

If S* is the optimal sequence, then for any k, the expected testing cost of S' should be greater than or equal 
to that of S*. Hence, by expanding and simplifying the logical expression £'[j r (5')] > E[J(S*)], we get 


i=k - 1 


c j(fc+i)(l ~ P( s f(0)( 1 “ /%(»*)) + 

i—i 
i=k — 1 

c j(.k)(l -P^iO+ilX 1 - Pj(k+i)) ~ “ /%(»’)) ^ 

i=i 

t=fc-i 

c i(fc)(! ~ XI P( s i( o)( 1 “ /%(*)) + 


i=l 


i=k 


c i(k+i)(l - XZp( s j(»))( 1 - %)) 


1=1 


Simplifying, 


P( s i(t))(l ~ Pi(k))/cj(k) > P(«i(fc+ i))(l - 0j(.k+i))/cj( k +i) 


(35) 


That is, the optimal sequence satisfies the above ordering relation. To prove the converse, observe that the 
inequality 

P( s i'(*))( 1 “ Pi'(,k))lcj>( k ) > P(«j(fc))(l - 0j(k))/cj(k) (36) 

implies that J E[J(S’*)] < f?[J(S')] and that therefore any sequence that is different from S* can be transformed 
to S * by successive exchanges of neighboring indices and the result is a reduction in cost. Therefore, the 
ordering relation (35) defines an optimal sequence of tests. 
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4 Near-optimal Test Sequencing using Information Heuristics and Certainty 
Equivalence 


An alternative to DP-based test sequencing algorithms is the class of approximation techniques that 
employ greedy heuristics based on information theory. For example, in a one-step lookahead information 
heuristic algorithm, if {/r,-(A:)} is the current information state at stage k, we select a test ij(k) if it maximizes 
the information gain per unit cost of the test. The selection rule is: 


|'IG({jT f (*)},*/(t))'| 

j(k) — are max < > 

6 i6{l,2,...,n} \ Cj J 

where IG({irj(fc)},tj(jt)) is the information gain given by: 

IG({jr, •(&)}, ^-(fc)) = #({*■,-(£)}) - H({xi(k + l)}|t/is applied) 
We can write the expression for the information gain explicitly as: 

m 

lG{{ni(k)} ,t. j(k) ) = ]T *i(k) ( dij log dij + (1 - dij) log(l - d 0 )) 
l 

m m 

'CO 1 (*)) ,o e(5C( 1 — dij)iri(k)) 
1=1 

m m 

-CY d ij*i( k )) '°g((X! d H T ^ k )) 


1 = 1 


1=1 


(37) 


(38) 


(39) 


i=i 


i=i 


Another alternative to DP-based algorithms is the Certainty Equivalence technique. In this approach, we 
compute the best test to be applied at every stage, assuming that the tests are reliable, using AO*algorithm 
[7] that uses the current posterior probabilities of failures for prior probabilities. Once the test result is 
known, the posterior probabilities of the failures are updated using (3) and the best test is computed again 
as above. 

Both of the above approaches do not mandate that a test cannot be repeated. Hence, a suitable stopping 
criterion is necessary in order to terminate the testing process. One 'stopping criterion is to compute the 
expected cost incurred on applying the chosen test and stop testing at the current stage if the computed 
cost is greater than the expected cost at the current stage. Another stopping rule would involve pruning 
the ambiguity group at every stage based on the posterior probabilities and stop when the ambiguity group 
of faults contains a single fault. A reasonable pruning rule can be devised by the following consideration: 
if the tests are only very slightly imperfect, then after the application of a fairly large number of tests, the 
posterior probabilities of non-existent failure states are reduced to a tiny fraction of their prior probabilities 
before testing. Hence, a failure source s, could be removed from the ambiguity group at stage k, if 


*«(*) < Jr,-(0)/A/p 

where N p is a factor suitably chosen (e.g., N p > 100). 


(40) 


5 Implementation of Dynamic Programming Solution 

The sequential testing problem formulated earlier via DP cannot be solved in its original form, since the 
state space (consisting of the posterior probability vector) is continuous. Hence, some form of discretization is 
necessary for the computer implementation of the DP method for this problem. Even with this discretization, 
we will see that problems having more than 20 failure sources cannot be solved optimally owing to the non- 
polynomial time complexity of DP. However, DP can serve as a benchmark against which the performance 
of near-optimal algorithms can be compared, at least for problems of small size. In the following, we present 
an efficient technique to implement the DP recursion that makes use of “lean” data structures. These data 
structures circumvent the explosive storage requirements of DP, while guaranteeing fast access to states. 
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5.1 Outline of the technique 

Before we get into the details of state space quantization, let us consider a rough overview of the solution 
procedure. Suppose that quantization is already performed and we have a set X = {a:,- : (1 < £ < n s )} of 
states at hand. Note that every element a;,- represents a vector posterior probabilities of failure sources. For 
example, a two failure source problem with n s = 3 uniform quantization levels results in xi = (1.0, 0.0), 
a : 2 = (0.5, 0.5), and = (0.0, 1.0). Let us also define an appropriate terminal cost function /(.), such that 
f(xi) is the cost incurred if no further testing is carried out at state a;,-. Note that /(.) depends on the 
maintenance/repair philosophy followed. Let Jk( x i) represent the optimal cost-to-go for state a:, at stage 
k (k = 1,2, . . .) of testing. For an N— stage DP problem (i.e., no more than N tests would be used before 
diagnosis/repair), by definition, 

•M*i) = /(*») i v !<*<«» (41) 

Now suppose there are n tests in the system, and it is desired to determine the optimal test to be performed 
for every state-stage {(a:,-, k) : 1 < t < n s , 1 < k < N — 1}. Let us define the state-mapping functions for the 
n tests, Tj p (x,),Tj f(xi), 1 < j < n, 1 < i. < n s , The definition of the state-mapping functions is as follows: 
when test j is applied at state x,-, the pass outcome takes the posterior probability state to Tj p (z,) G X, 
and the fail outcome transforms it to T)/(z,-) G X. Let Pj p (xi) and Pjf(x{) be the associated probabilities 
of these events conditioned on state Xi. 

The recursive DP formulation of (4) can be adapted to the above quantized version as follows: 


Jk( x i) — min { C j + Pj p (xi)Jk+i(Tj p (x,-)) + Pjj (xi)Jk+\ (Tjf(xi))} 

l<i<n„l<k<N-l (42) 

The index j that maximizes the above recursion is the best test to apply at stage k. Thus, we initialize this 
recursion at k — N with, 

J*(*i) = f(xi) 1 < i < n, (43) 

and carry through backwards from stage k = N — 1 to k = 1. 

However, the computer implementation of this recursion requires consideration of the following important 
issues that directly affect the size of the problems that can be solved: 

1. Quantization Scheme: We need to determine the optimal quantization scheme to map floating point 
probabilities (that can lie anywhere in [0,1]) to discrete levels. We will see that any simplistic rule to 
quantize the probabilities may result in quantization levels that do not map valid probability states. 

2. State Space Data Structures: If the storage is not an issue, then the above recursions can be solved 
very easily by precomputing the mapping functions Tj p (xi),Tjj(xi), 1 < j < n, 1 < i < n s . However, 
we will see that the storage requirements are prohibitively high for even small problems with not too 
many quantization levels. Hence, we need to determine efficient ways of storing the discrete probability 
states and computing the test-mapping functions on the fly. 

In the following, we consider the above issues and present effective solutions that let us push the envelope 
in solving such an intractable problem. 

5.2 Quantization 

The problem of probability state quantization is formulated as follows. Consider a posterior probability 
state space P m of m dimensions. That is, a valid state p G P m is a vector of m elements {pi,P 2 < 
such that, 

m 

y>i = 1.0 0 < pi < 1 V 1 < i < m (44) 

«=i 

Suppose we want to uniformly divide the interval [0, 1] into n q divisions, i.e., we ordain that the only valid 
probabilities are {0,6,26,.. ,,n ? 6}, where 6 = l/n f is the quantization interval, and n q + 1 is the number 
of quantization levels. For a specified n q , the objective is to determine a set of m non-negative integers 
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(gi, < 72 , • • . ,g m } such that the vector p — {?i^ • - • , 9m^} represents the quantized probability state. 
Clearly, it is necessary to have, 

m 

= n ? (45) 

i=i 

Note that various simplistic scalar quantization rules such as {g,- = or {g,- = [p,-/<5J), or even 

{g t - = lpi/S + 0.5J}, will result in quantized states that do not satisfy (45) for most choices of n q . Hence, 
we need to devise a vector quantization scheme, that transforms any given probability state to a valid 
quantized probability state. A suitable criterion to choose the integers {gi,g 2 , ■ • .,?m} could be to minimize 
the Euclidean distance between the quantized and unquantized probability states. 

Formally, the optimal choice of the quantization vector g = (gi,g 2 , - • -,4™} minimizes the Euclidean 
distance measure between the absolute and quantized probability states defined by, 


m 

<%) = ~ S i<5 ) 2 

1=1 


(46) 


subject to the constraint, 

m 

J2qi-n q (47) 

i=i 


This is a resource allocation problem with quadratic cost function which has a well-known optimal solution 
procedure via greedy approach [5]. This approach starts by assigning zeros to allg,, and incrementing one g,- 
at a time by 1, that results in the maximum decrease of the cost function in (46). However, a direct application 
of this algorithm requires mn q computations of cost function decrements (mn q multiplications). In the 
following, we present a technique that converges to the optimal solution requiring at most m 2 computations 
of cost function decrements and m divisions. Our technique results in substantial computational savings for 
large values of n q . 

The basic idea involved in our technique is to compute a fast, but accurate first estimate of the quan- 
tization levels, and then use the greedy algorithm from that point on, instead of starting from an all-zero 
q vector. With this in mind, let us now consider the following version of the above problem with a tighter 
constraint set: 

m 

Minimize d(q) = 2^(Pi — g>^) 2 (48) 

i=l 


subject to the constraint, 


qiS <pi V 1 < i < m, 0 < g; 


(49) 


Suppose, the solution to the above version is given by q = {gi, g 2 , . - - , g m }. The original problem can be 
reformulated in terms of this partial solution as follows: 


rri 

Minimize d(r) = ]jP(pi — q{6 — r,f) 2 (50) 

tel 

subject to the constraint, 

m m 

tel tel 

If {r,} are constrained to be positive, then an appropriate change of variables results in the same resource 
allocation problem as in (46), but with a reduced resource constraint. It can be easily shown that the 
resource constraint n q - g,- in the reduced problem can never exceed m. Then, a quick solution of the 

problem in (48) would reduce the number of cost function computations from mn q tom 2 . In the following, 
we present the optimal solution to the modified problem and show that {r<} are all positive for the optimal 
solution, allowing us to use the greedy approach to solve the reduced resource allocation problem. 


Lemma 1 The optimal solution to the modified problem in (48) is given by, 


Qi = \pi/S\ V 1 < i < m 


(52) 
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Proof: Clearly {gi}^ is feasible. Increasing any {g,-} results in an infeasible solution. Decreasing any 
{<?,-} results in cost increase. Hence, {g,-}^ is an optimal feasible solution to the problem in (48) and (49). 

Lemma 2 The resource variables of the reduced problem in (50) are non-negative for the optimal 

solution. 


Proof: It suffices to show that the optimal solution vector q satisfies 

g,- > g t - V 1 < i < m (53) 

Suppose this is not true, and that for some k, qk = g* — 1. Since the elements in g satisfy (45) and since 
the elements in g sum to an integer less than or equal to n q , there must exist some index m such that 
gsr = 4m + 1. Assume without loss of generality that m = (k+ 1). Now, consider an alternative quantization 
vector q a where, 


h 

11 

•••» 


4m 

t = k + 1 

(54) 

<li 

otherwise 



The difference between the cost functions induced by the above two quantization vectors can be written as, 

d(q) - d(q a ) = ( p k - q k b - b) 2 + (p k+1 - q k+l 6 + b) 2 
~(Pk - 4kb) 2 - (p k+ 1 - g*+i<5) 2 
= S\Pk + 1 — 4k+lb + (g k + 1)£ — p*)] 

> 0 (55) 

The final inequality in the above equation follows directly from the definition of g,- = [/>,•/<$] , implying p,- > g,- 
and (g,- + 1)<5 > p,-. Thus we see that a solution g violating the statement of the lemma cannot be optimal. 
Hence, it follows that the optimal solution always contains q thereby forcing the variables r,- in (50) to be 
non-negative. 

It is instructive to determine the total number of distinct discrete probability states resulting from such 
a quantization scheme. This can be formally written as the number of distinct solutions in non-negative 
integers for the following equation: • 

m 

Yl < p- n v ( 56 ) 

i 

Lemma 3 The total number of distinct discrete probability states arising out of quantization of an tri- 
dimensional probability space (m failure sources ) into n q divisions along each probability coordinate is given 

Proof: Consider a line segment of length n q , with points Pa, Pi , . . ., P„ q marked out at integer intervals. 
Any solution (in positive integers) of (56) corresponds to a decomposition of this segment into m pieces 
whose lengths are positive integers. The m — 1 end points of these pieces (other than P 0 and P„ q ) must be 
chosen from among the n q — 1 points Pi,P 2 , . . . , P Uq - 1 - This can be done in (^l}) wa y s - However, note 
that we are looking for all non-negative solutions of the problem. Adding m to both sides of (56), we get 

m 

X^(?<- + !) = n q + m (57) 

i 

Now the variables y, = q, + 1 are strictly positive if g,- are non-negative, and there are ways of 

choosing distinct, positive y,- variables. Thus, the number of non-negative integral solutions is identical and 
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5.3 Data Structures for State Space Representation 

In the previous section, we presented the quantization procedure for the discretization of posterior prob- 
ability space. The DP recursion described in (42) requires the following fundamental operations to be 
performed repeatedly: 

• Given a quantized state Xi, compute the resulting states T jp (x i ) and Tjf(x { ) due to pass and fail 
outcomes of an admissible test j. 

• Access the cost-to-go estimates at the states Tj p {xi) and Tjj(xi) obtained in the earlier cycle of com- 
putation and revise the cost-to-go estimate at X{. 

A naive approach to address the above operations is to precompute the mapping functions TjpO and 
and store the appropriate pointers in each so that Tj p (xi) and 7}/(z,-) states can be accessed 
directly from a:,- for any given test j. This requires an extra storage of n("' ! r ^™ 1 ~ 1 ) pointer variables (which 
require 4 bytes each on most computer systems), where n q is the number of quantization divisions of each 
probability coordinate, m is the number of failure sources and n is the number of tests. Clearly, a runtime 
calculation of these mapping functions and efficient data structures that enable fast access of the transformed 
states, would free up so much valuable memory space that we would be able to solve a much larger dimensional 
problem than is possible with the above simplistic approach. 

However, this approach requires us to devise methods to: 

• enumerate and store the quantized states in efficient data structures. 

• access the cost-to-go for a given state. 

These are no simple tasks, since a simplistic table storage of states (each state is a collection of m integers) 
takes up (” , ^™ 1 _ 1 )m[log 10 n q ] bytes of memory space on conventional computer systems (assuming that the 
integers are concatenated to form a string). And random access of a state in such a table requires an average 
of ( n<, +!Tr 1 )/ 2 comparisons. 

In the following, we present a highly storage-efficient, fast-access data structure tuned for this purpose. 
We first need to introduce some notation in order to give a formal description of the data structures involved. 
Consider a directed graph T = ( V , E ) where, V is the set of vertices (nodes) and E is the set of edges. In 
addition, let T be a directed rooted tree having one vertex which is the head of no edges (called the root) 
and each vertex except the root is the head of exactly one edge. The relation ( v , w) is an edge of T denoted 
by v — ► w. If v — > • w, then v is termed parent of w and w is the child of v. Let the function d(v) represent 
the depth of the node v in the rooted tree. 

In order to illustrate why rooted tree is chosen to represent the set of discretized probabilities, let us 
consider an example system of m = 4 failure sources and n q = 3 quantization intervals. We then obtain 
following quantization vectors (shown in the next page.) 

Blanks are used whenever 9; remained unchanged from its previous value in order to bring out the 
similarity of the enumerated state space to a rooted tree. Also, note that 93 and 94 are intentionally bunched 
together, since the last coordinate (in this case 94 ) is fixed when the first m — 1 coordinates are defined, 
hence its storage can be eliminated. By placing a node at every non-blank entry in the above table and 
connecting nodes from left to right, i.e., 91 nodes to 92 nodes, 92 nodes to 93 nodes, we can form a directed 
rooted tree, where every node is a child of just one parent. The nodes in the first layer (91 nodes) can be 
assumed to be emanating from a single dummy node 90 for the sake of completeness. 
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9i 92 93,94 

0 0 0,3 

1,2 

2,1 

3.0 

1 0,2 

1.1 
2,0 

2 0,1 
1,0 
3 0,0 

1 0 0,2 

1,1 

2,0 

1 0,1 
1,0 

2 0,0 
2 0 0, 1 

1,0 

1 0,0 

3 0 0,0 


The data structure based on the above directed rooted tree would consist of the following elemental 
structures: The elemental data structures for representing the above rooted tree and the pseudo-code for 
the associated state-access routines are described in Appendix A. 

The total number of nodes in such a rooted tree structure is bounded by 2( n, *™ 1 _1 ), and the combined 
memory requirement for a DP scheme utilizing these data structures is no more than 7 ( n, *™ 1 ~ 1 ) bytes on 
most conventional computer systems. In addition to being inexpensive in terms of storage, note that the 
access to the cost and policy corresponding to a given quantized state does not take .more than m — 1 
operations, making it attractive for runtime computation of test mapping functions. 

5.4 Terminal Cost Function 


As defined earlier, the terminal cost function is the probabilistic cost incurred when the testing is stopped 
at a given quantized probability state. We define the following cost function for our DP implementation: 

m 

m=f'i + (l-9(0/"*)CW+ £ C Mi q(i)/n g (58) 

where 

i' = argmax?(i) (59) 

i 

This definition corresponds to repairing the component with the highest posterior probability value. It makes 
sense to choose this terminal cost because of the following reason: when the test costs are significantly lower 
than the false repair costs and missed repair costs (which is usually the case in practice), then there should 
be an incentive to apply another test and skew the probability distribution to reduce the entropy of the state. 
For instance, the uncertainty in the state (0.1, 0.9) is less than that of the state (0.2, 0.8) and hence should 
have a lower terminal cost. However, the terminal cost difference between the states (0.2, 0.8) and (0.21,0.79) 
should not be sizable. The above definition conforms to this principle and also gives us a consistent stopping 
rule: testing should be stopped when the average cost incurred after applying any test is higher than the 
cost of stopping at the present state. 
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5.5 Simulation Results 
5.5.1 Comparison with DP 

In order to compare the performance of the information heuristics and certainty equivalence with DP, we 
considered two small systems; one with 3 failures and 3 tests and another with 5 failures and 5 tests. For 
these systems, it is possible to quantize the posterior proabilities into very small intervals, thus resulting 
in an accurate implementation of the Dynamic Programming recursion. Specifically the two systems we 
considered are described below: 


Table 1: Parameters of System 1 
Number of Faults = 3 
Number of Tests = 3 

Number of DP Quantization Levels = 500 
D-Matrix 
1 1 0 
0 1 1 
0 0 1 

Test Costs 1.0 1.0 1.0 
False Repair Costs 100.0 100.0 100.0 
Missed Repair Costs 100.0 100.0 100.0 
Prior Probs of Faults 0.25 0.35 0.4 


Table 2: Parameters of System 2 

Number of Faults = 5 
Number of Tests = 5 
Number of DP Quantization Levels = 40 
D-Matrix 
110 0 0 
0 1 TO 0 
0 0 110 
0 0 0 1 1 
0 0 0 0 1 

Test Costs 1.0 1.0 1.0 1.0 1.0 
False Repair Costs 100.0 100.0 100.0 100,0 100.0 
Missed Repair Costs 100.0 100.0 100.0 100.0 100.0 
Prior Probs of Faults 0.25 0.2 0.3 0.15 0.1 


Tables 3-6 show the comparitive performance of various algorithms (multi-step look-ahead DP, multi- 
step information heuristics and Certainty Equivalence) for system 1 for various values of test unreliabilities. 
Tables 7-10 show the comparitive performance of these algorithm for System 2. Tables 11-12 show the 
comparative performance of the information heuristic and certainty equivalence techniques for Graham and 
Garey’s pathological example [6] with m = 10. Note that INFO(Jb) denotes information heuristics with fc-step 
look-ahead, DP(F) denotes dynamic programming with ifc-step look-ahead, and CE denotes the Certainty 
Equivalence technique. It is observed that for low values of test unreliabilities, the heuristic techniques have 
resulted in near-optimal solutions. However, their performance degrades as the probabilities of false alarm 
and missed detection were increased. Also, it is interesting to note that there is not much difference between 
the performances of info-heuristic technique and certainty equivalence approach for systems 1 and 2 for 
low values of test unreliabilities. Another interesting observation is that CE resulted in consistently lower 
probability of error compared to information heuristics. However, for the worst case example of Graham 
and Gary, CE was always better than the information heuristics. 
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Metrics 

DP(1) 

DP(3) 

DP(7) 

DP(15) 

Ave. Test Length 

2.29521 

2.93849 

2.94194 

2.9416 

Ave. Testing Cost 

11.3974 

7.88139 

7.48886 

7.1705 

Prob. of Error 

0.0455903 

0.0247351 

0.0227501 

0.021155 

Ave. Info Gain 

0.678008 

0.533618 

0.534378 

0.535792 


Table 3: Comparison of Various DP Methods for Pf=0.05, Pm=0.05(System 1) 


Metrics 

INFO(l) 

INFO(2) 

INFO(3) 

CE 

Ave. Test Length 

2.29273 

2.30669 

2.54352 

2.29553 

Ave. Testing Cost 

11.515 

11.1669 

11.4625 

11.3658 

Prob. of Error 

0.0461903 

0.0443203 

0.0446153 

0.0454103 

Ave. Info Gain 

0.678128 

0.677073 

0.649883 

0.677529 


Table 4: Comparison of Various Heuristic Methods for Pf=0.05, Pm=0.05(System 1) 


Metrics 

DP(1) 

DP(3) 

DP(7) 

DP(15) 

Ave. Test Length 

2.96716 

3.63474 

3.63336 

4.10859 

Ave. Testing Cost 

17.9693 

8.04204 

7.95276 

8.39895 

Prob. of Error 

0.0750394 

0.02205 

0.0216099 

0.0214649 

Ave. Info Gain 

0.392182 

0.451167 

0.45221 

0.433815 


Table 5: Comparison of Various DP Methods for Pf=0.10, Pm=0.10(System 1) 


Metrics 

INFO(l) 

INFO(2) 

INFO(3) 

CE 

Ave. Test Length 

3.07231 

3.08141 

3.09564 

3.08723 

Ave. Testing Cost 

12.0591 

12.6102 

12.3495 

12.6811 

Prob. of Error 

0.0449601 

0.0476701 

0.0462951 

0.0479951 

Ave. Info Gain 

0.456213 

0.45486 

0.453465 

0.453524 


Table 6: Comparison of Various Heuristic Methods for Pf=0.10, Pm=0.10(System 1) 


Metrics 

DP(1) 

DP(3) 

DP(7) 

DP(15) 

Ave. Test Length 

2.41679 

3.38965 

3.53795 

3.52495 

Ave. Testing Cost 

21.4411 

13.1638 

13.4766 

13.8521 

Prob. of Error 

0.0952055 

0.0490145 

0.0497751 

0.0517699 

Ave. Info Gain 

0.651796 

0.563471 

0.562667 

0.562444 


Table 7: Comparison of Various DP Methods for Pf=0.05, Pm=0.05(System 2) 


Metrics 

INFO(l) 

INFO(2) 

INFO(3) 

CE 

Ave. Test Length 

2.56769 

2.56358 

2.56213 

4.41764 

Ave. Testing Cost 

19.4079 

18.9453 

19.1493 

10.6517 

Prob. of Error 

0.0842089 

0.0819036 

0.0829637 

0.0312653 

Ave. Info Gain 

0.667016 

0.669341 

0.668698 

0.519862 


Table 8: Comparison of Various Heuristic Methods for Pf=0.05, Pm=0.05(System 2) 


Metrics 

DP(1) 

DP(3) 

DP(7) 

DP(15) 

Ave. Test Length 

3.7586 

4.23867 

4.30183 

6.97485 

Ave. Testing Cost 

24.868 

17.8838 

21.7445 

16.1277 

Prob. of Error 

0.105529 

0.0682713 

0.0872903 

0.0458495 

Ave. Info Gain 

0.426733 

0.440533 

0.482891 

0.285431 


Table 9: Comparison of Various DP Methods for Pf=0.10, Pm=0.10(System 2) 
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Metrics 

INFO(l) 

INFO(2) 

INFO(3) 

CE 

Ave. Test Length 

3.09077 

3.0923 

3.08582 

5.10162 

Ave. Testing Cost 

29.0598 

29.1433 

28.8392 

19.0721 

Prob. of Error 

0.129892 

0.130332 

0.128832 

0.0698944 

Ave. Info Gain 

0.48225 

0.482189 

0.483567 

0.405789 


Table 10: Comparison of Various Heuristic Methods for Pf=0.10, Pm=0.10(System 2) 


Metrics 

INFO(l) 

INFO(2) 

INFO(3) 

CE 

Ave. Test Length 

17.5606 

17.5607 

18.7088 

15.1227 

Ave. Testing Cost 

17.5606 

17.5607 

18.7088 

15.1227 

Prob. of Error 

0 

0 

0 

0 

Ave. Info Gain 

0.0241986 

0.0241994 

0.0226679 

0.0282088 

Comparison of Various Methods for Pf=0.1Q, 

?m=0.10(Gary’ model w 

Metrics 

INFO(l) 

INFO(2) 

INFO(3) 

CE 

Ave. Test Length 

16.4974 

16.4957 

17.4856 

9.92901 

Ave. Testing Cost 

16.4974 

16.4957 

17.4856 

9.92901 

Prob. of Error 

0 

0 

0 

0 

Ave. Info Gain 

0.0257201 

0.0257262 

0.0242757 

0.0424652 


Table 12: Comparison of Various Methods for Pf=0.05, Pm=0.05(Gary’ model with m=10) 

6 Top-Down Graph Search Algorithms 


The top-down algorithms described in [7] can be readily applied even when the tests are imperfect. This is 
because, the HEFs (required for AO ‘based algorthms) and the information gain expressions depend only on 
the posterior probability distribution of the failure sources at the current ambiguity node. These posterior 
probabilities can be computed via the Bayes rule given in (3). However, we found that the AO'based 
algorithms are not useful due to the explosion of the diagnostic strategy even for moderately sized systems. 
On the other hand, the top-down information heuristic algorithms coupled with the ambiguity pruning 
technique described earlier, enabled us to solve large systems. Tables 13-18 demonstrate the performance 
of top-down information heuristic algorithm for various randomly generated systems of different sizes and 
for various values of false alarm and missed detection probabilities of tests. Note that, a denotes the false 
alarm probability, and j3 denotes the missed detection probability. The following performance indicators 
were collected and listed in these tables: 


• J c is the expected testing cost 

• J r is the expected repair cost composed of the missed repair and false repair costs 

• Jn is the average ambiguity group size 

• ni is the number of leaf nodes in the diagnostic strategy 

• n e is the total number of nodes in the decision tree 

We can see that even the slightest uncertainty in the test outcomes results in large diagnostic trees with 
increased testing and repair costs, albeit with tolerable values of average ambiguity group sizes. Table 19 
shows the performance of the top-down information heuristic algorithm for random systems of various sizes 
with fixed test uncertainties (a = 0.01,/? = 0.01). We can see that a system containing as many as 2000 
failures and 2000 imperfect tests is solved in less than 30 minutes. 
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(<*,/?) 

Jc 

Jr 

Jn 

ni 

"e 

Time(secs) 

(0.00,0.00) 

6.468 

0.000 

1.000 

100 

199 

0.29 

(0.01,0.00) 

8.047 

0.093 

1.002 

433 

865 

1.35 

(0.02,0.00) 

8.184 

0.013 

1.000 

452 

903 

1.36 

(0.03,0.00) 

8.197 

0.010 

1.000 

448 

895 

1.35 

(0.04,0.00) 

9.608 

1.236 

1.029 

938 

1875 

3.26 

(0.05,0.00) 

9-722 

1.678 

1.035 

965 

1929 

3.31 


Table 13: Performance of Top-Down Algorithm for a (100,100) system with false alarms only 


(a./?) 

Jc 

Jr 

Jn 

Til 

n e 

Time(secs) 

(0.00,0.00) 

6.468 

0.000 

1.000 

100 

199 

0.29 

(0.00,0.01) 

7.893 

0.084 

1.001 

424 

847 

1.33 

(0.00,0.02) 

8.000 

0.003 

1.000 

439 

877 

1.32 

(0.00,0.03) 

8.055 

0.002 

1.000 

437 

873 

1.33 

(0.00,0.04) 

9.275 

1.231 

1.030 

873 

1745 

3.08 

(0.00,0.05) 

9.391 

1.582 

1.033 

919 

1837 

3.19 


Table 14: Performance of Top-Down Algorithm for a (100,100) system with missed detections only 


(«./?) 

Jc 

Jr 

Jn 

m 

n e 

Time(secs) 

(0.00,0.00) 

6.468 

0.000 

1.000 

100 

199 

0.29 

(0.01,0.01) 

9.013 

0.274 

1.005 

889 

1777 

2.95 

(0.02,0.02) 

9.129 

0.037 

1.001 

947 

1893 

3.01 

(0.03,0.03) 

9.155 

0.025 

1.000 

952 

1903 

3.03 

(0.04,0.04) 

11.410 

8.603 

1.266 

2650 

5299 

13.40 

(0.05,0.05) 

11.434 

10.964 

1.286 

2908 

5815 

14.27 


Table 15: Performance of Top-Down Algorithm for a (100,100) system with false alarms and missed 

detections 



Jc 

Jr ' 

' Jn 

Til 

n e 

Time(secs) 

(0.00,0.00) 

7.417 

0.000 

1.000 

200 

399 

1.26 

(0.01,0.00) 

8.677 

0.906 

1.016 

785 

1569 

6.43 

(0.02,0.00) 

8.874 

0.311 

1.005 

926 

1851 

6.64 

(0.03,0.00) 

8.984 

0.227 

1.003 

953 

1905 

6.67 

(0.04,0.00) 

10.157 

2.159 

1.073 

1874 

3747 

16.39 

(0.05,0.00) 

10.208 

3.006 

1.088 

1910 

3819 

16.63 


Table 16: Performance of Top-Down Algorithm for a (200,200) system with false alarms only 


(<*,/?) 

Jc 

Jr 

Jn 

n t 

Tie 

Time(secs) 

(0.00,0.00) 

7.417 

0.000 

1.000 

200 

399 

1.27 

(0.00,0.01) 

8.640 

0.841 

1.015 

763 

1525 

6.08 

(0.00,0.02) 

8.883 

0.333 

1.006 

886 

1771 

6.27 

(0.00,0.03) 

9.011 

0.169 

1.003 

924 

1847 

6.34 

(0.00,0.04) 

10.235 

2.029 

1.068 

1790 

3579 

15-46 

(0.00,0.05) 

10.284 

2.744 

1.079 

1848 

3695 

15.65 


Table 17: Performance of Top-Down Algorithm for a (200,200) system with missed detections only 
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(«./?) 

Jc 

Jr 

Jn 

n, 

n e 

Time(secs) 

(0.00,0.00) 

7.417 

0.000 

1.000 

200 

399 

1.26 

(0.01,0.01) 

9.518 

2.538 

1.048 

1462 

2923 

13.39 

(0.02,0.02) 

9.917 

0.859 

1.015 

1907 

3813 

14.28 

(0.04,0.04) 

12.062 

11.625 

1.540 

4839 

9677 

64.50 

(0.05,0.05) 

12.122 

16.071 

1.635 

5148 

10295 

66.62 


Table 18: Performance of Top-Down Algorithm fof a (200,200) system with false alarms and missed 

detections 


m,n 

n, 

n e 

Time(secs) 

500 

2069 

4137 

90.26 

1000 

3094 

6187 

378.89 

1500 

4136 

8271 

891.50 

2000 

5267 

10533 

1653.06 


Table 19: Performance of Top-Down Algorithm for systems of various sizes with a = 0.01 , 0 = 0.01 

7 Summary 

In this paper, we considered the problem of test sequencing in the presence of imperfect tests. The test 
sequencing problem is a partially observed Markov decision problem (POMDP), a sequential multi-stage 
decision problem wherein the states are probabilities of the set of possible failure sources and information 
regarding the states is obtained via the results of imperfect tests. The optimal solution for this problem can be 
obtained by applying a continuous state Dynamic Programming (DP) recursive equation. However, the DP 
recursion is computationally very expensive owing to the continuous nature of the state vector comprising 
the probabilities of faults. In order to alleviate this computational explosion, we presented an efficient 
approach for implementing the DP recursion for this problem. In addition, we presented multi-step DP, 
multi-step information heuristics and certainty equivalence algorithms for interactive diagnosis of systems 
with imperfect tests. We also considered various problems with special structure (parallel systems) and 
derived closed form solutions/index-rules without having to resort t,o DP. We also presented computational 
results demonstrating the effectiveness of the information heuristic based top-down graph search algorithm: 

A Data Structures and Pseudo-code for DP Implementation 


RootedTreeNode — 

{ 

NumberOfChildNodes (Integer) 
ArrayOfChildNodes (Pointer to RootedTreeNode) 
IndexIntoCost Vector (Integer) 

} 

Data Structure to Represent a Node in the Rooted Tree 


CostVectorNode 

{ 

EstimateOfCostToGo (Floating Point Variable) 
Policy (Integer) 

} 

Data Structure to Represent a Node in the Cost Vector 


Procedure RootedStateTreeConstructor 
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Inputs: 

RootedTreeNode CurrentNode By Reference 
Integer SumTillNow By Value 
Integer FaultNum By Value 
Integer Statelndex By Reference 
CostVectorArray Cost Vector By Reference 
Integer Num Levels By Value 
IntegerArray CurrentState By Reference 

{ 

if(FaultNum = NumFaults-1) 

{ 

Cost Vector[StateIndex]. Cost = 

Terminal CostFunction(CurrentState) 

CurrentNode. IndexIntoCostvector = Statelndex 

Statelndex = Statelndex+l 

return 

} 

CurrentNode. NumberOfChildNodes = NumLevels- SumTillNow +1 

(create storage for childnodes too) 

for i=l to CurrentNode. NumberOfChildNodes 

{ 

CurrentState[FaultNum+l] = i 

Invoke RootedStateTreeConstructor() with following inputs: 
CurrentNode.ArrayOfChildNodes[i] 

SumTillNow+i 

FaultNum+1 

Statelndex 

Cost Vector 

NumLevels 

CurrentState 

} 

} 

Algorithm for Rooted Tree Construction 


Procedure Get Cost AndPolicy For Quant izedState 
Inputs: 

RootedTreeNode CurrentNode By Reference 
CostVectorArray Cost Vector By Reference 
Integer NumFaults By Value 
IntegerArray CurrentState By Reference 
Outputs: 

Cost 

Policy 

{ 

for i=l to NumFaults-1 

{ 

CurrentNode = 

CurrentNode. ArrayOfChildNodes[QuantizedStateVector[i]] 
return 

} 

Statelndex = CurrentNode.IndexIntoCost Vector 
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Cost = CostVector[StateIndex].EstimateOfCostToGo 
Policy = CostVector[StateIndex].Policv 

} 

Algorithm for Accessing Cost and Policy of a Quantized State 
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